How to split contents of a column into *only two* by a delimiter?

While working with PySpark, I came across a requirement, where data in a column had to be split using delimiters in the string. However, there was a caveat! Only two child columns were to be created.

Which would mean, for two or more occurrences of the delimiter, only 1st occurrence would count! Rest of the string after 1st occurrence of the delimiter would go to the 2nd column.

eg. by data:

|Animal|           Food|
|   Dog|meat,bread,milk|
|   Cat|     mouse,fish|
|Animal|           Food|Food1|     Food2|
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|

I approached it with brute force method first:
Step1. Splitting and saving the string before first ‘,’ into Food1.
Step2. Using regex to save the complete string except Food1 into Food2.
Step3. Finally, deleting the first character of Food2 column to deal with an unnecessary ‘,’.

testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])

testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
        .withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
        .withColumn("Food2",expr("substring(Food2, 2)")).show()

The task was accomplished, but obviously there had to be an elegant solution! I headed to stackoverflow. (Check this link for other solutions!)

Using string functions solved the problem as it should be, i.e without dealing with regex or unnecessary additional steps.

testdf.withColumn("Food1", expr("""substring(Food,1,instr(Food,',')-1)"""))\
      .withColumn("Food2", expr("""substring(Food,instr(Food,',')+1,length(Food))""")).show()

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s