PySpark DataFrame ordering based on multiple columns

How to deal with the requirement to order a DataFrame using more than one column simultaneously? Also, consider some values to be “null” which should be put at the end. (Check reference doc. at the end for other options.)

df = spark.createDataFrame([('Tom', 80), (None, 60),\
                            ('Alice', None), ('Alice', 10),\
                            ('Alice', 11), ('Alice', 13)],\
                           ["name", "height"])
df.show()

#+-----+------+
#| name|height|
#+-----+------+
#|  Tom|    80|
#| null|    60|
#|Alice|  null|
#|Alice|    10|
#|Alice|    11|
#|Alice|    13|
#+-----+------+

df.orderBy(df.name.asc_nulls_last(),\
           df.height.desc_nulls_last()).show()

#+-----+------+
#| name|height|
#+-----+------+
#|Alice|    13|
#|Alice|    11|
#|Alice|    10|
#|Alice|  null|
#|  Tom|    80|
#| null|    60|
#+-----+------+

Relevant documentation can be found here.

1 Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s