How to split contents of a column into *only two* by a delimiter?

While working with PySpark, I came across a requirement, where data in a column had to be split using delimiters in the string. However, there was a caveat! Only two child columns were to be created.

Which would mean, for two or more occurrences of the delimiter, only 1st occurrence would count! Rest of the string after 1st occurrence of the delimiter would go to the 2nd column.

eg. by data:

+------+---------------+
|Animal|           Food|
+------+---------------+
|   Dog|meat,bread,milk|
|   Cat|     mouse,fish|
+------+---------------+
+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+

I approached it with brute force method first:
Step1. Splitting and saving the string before first ‘,’ into Food1.
Step2. Using regex to save the complete string except Food1 into Food2.
Step3. Finally, deleting the first character of Food2 column to deal with an unnecessary ‘,’.

testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])


testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
        .withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
        .withColumn("Food2",expr("substring(Food2, 2)")).show()

The task was accomplished, but obviously there had to be an elegant solution! I headed to stackoverflow. (Check this link for other solutions!)

Using string functions solved the problem as it should be, i.e without dealing with regex or unnecessary additional steps.

testdf.withColumn("Food1", expr("""substring(Food,1,instr(Food,',')-1)"""))\
      .withColumn("Food2", expr("""substring(Food,instr(Food,',')+1,length(Food))""")).show()

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s