While working with PySpark, I came across a requirement, where data in a column had to be split using delimiters in the string. However, there was a caveat! Only two child columns were to be created.
Which would mean, for two or more occurrences of the delimiter, only 1st occurrence would count! Rest of the string after 1st occurrence of the delimiter would go to the 2nd column.
eg. by data:
+------+---------------+
|Animal| Food|
+------+---------------+
| Dog|meat,bread,milk|
| Cat| mouse,fish|
+------+---------------+
+------+---------------+-----+----------+
|Animal| Food|Food1| Food2|
+------+---------------+-----+----------+
| Dog|meat,bread,milk| meat|bread,milk|
| Cat| mouse,fish|mouse| fish|
+------+---------------+-----+----------+
I approached it with brute force method first:Step1
. Splitting and saving the string before first ‘,’ into Food1
.Step2
. Using regex to save the complete string except Food1
into Food2
.Step3
. Finally, deleting the first character of Food2
column to deal with an unnecessary ‘,’.
testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])
testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
.withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
.withColumn("Food2",expr("substring(Food2, 2)")).show()
The task was accomplished, but obviously there had to be an elegant solution! I headed to stackoverflow. (Check this link for other solutions!)
Using string functions
solved the problem as it should be, i.e without dealing with regex or unnecessary additional steps.
testdf.withColumn("Food1", expr("""substring(Food,1,instr(Food,',')-1)"""))\
.withColumn("Food2", expr("""substring(Food,instr(Food,',')+1,length(Food))""")).show()