PySpark DataFrame ordering based on multiple columns

PiyushJuly 11, 2020July 8, 2020Data Engineering, PySpark, Software Playbooks

How to deal with the requirement to order a DataFrame using more than one column simultaneously? Also, consider some values to be “null” which should be put at the end. (Check reference doc. at the end for other options.)

df = spark.createDataFrame([('Tom', 80), (None, 60),\
                            ('Alice', None), ('Alice', 10),\
                            ('Alice', 11), ('Alice', 13)],\
                           ["name", "height"])
df.show()

#+-----+------+
#| name|height|
#+-----+------+
#|  Tom|    80|
#| null|    60|
#|Alice|  null|
#|Alice|    10|
#|Alice|    11|
#|Alice|    13|
#+-----+------+

df.orderBy(df.name.asc_nulls_last(),\
           df.height.desc_nulls_last()).show()

#+-----+------+
#| name|height|
#+-----+------+
#|Alice|    13|
#|Alice|    11|
#|Alice|    10|
#|Alice|  null|
#|  Tom|    80|
#| null|    60|
#+-----+------+

Relevant documentation can be found here.

Published by Piyush

View all posts by Piyush

1 Comment

Pingback: How to retain the first row of each ‘group’ in a PySpark DataFrame? – Piyush Routray

Leave a comment Cancel reply