How to retain the first row of each ‘group’ in a PySpark DataFrame?

For a given dataframe, with multiple occurrence of a particular column value, one may desire to retain only one (or N number) of those occurrences.

from pyspark.sql.window import Window
from pyspark.sql import Row
from pyspark.sql.functions import *

  
df = sc.parallelize([ \
     Row(name='Bob', age=5, height=80), \
     Row(name='Alice', age=5, height=90), \
     Row(name='Bob', age=5, height=80), \
     Row(name='Alice', age=5, height=75), \
     Row(name='Alice', age=10, height=80)]).toDF()

df.show()

#+---+------+-----+
#|age|height| name|
#+---+------+-----+
#|  5|    80|  Bob|
#|  5|    90|Alice|
#|  5|    80|  Bob|
#|  5|    75|Alice|
#| 10|    80|Alice|
#+---+------+-----+

dropDuplicates can be used when one doesn’t care about which occurrence of the data is to be dropped. eg. if in the above data, you do not care about Alice’s age or height and would desire any one row containing “name=Alice”

df.dropDuplicates(["name"]).show()

#+---+------+-----+
#|age|height| name|
#+---+------+-----+
#|  5|    80|  Bob|
#| 10|    80|Alice|
#+---+------+-----+
#
#______or
#
#+---+------+-----+
#|age|height| name|
#+---+------+-----+
#|  5|    80|  Bob|
#|  5|    75|Alice|
#+---+------+-----+

However, for some reason, if you need to retain only the first occurrence of that column value. Imagine the dataframe is ordered by certain column(s) and thus you cannot afford just any occurrence of the data. Then you have to use window function in spark.
[Relevant imports were already done in first code snippet.]

window = Window.partitionBy("name").orderBy('tiebreak')

df.withColumn('tiebreak', monotonically_increasing_id())\
 .withColumn('rank', rank().over(window))\
 .filter(col('rank') == 1).drop('rank','tiebreak')\
 .show()

#+---+------+-----+
#|age|height| name|
#+---+------+-----+
#|  5|    80|  Bob|
#|  5|    90|Alice|
#+---+------+-----+

Unlike dropDuplicates(), every time the window-rank code method is run, you will get only the first occurrence of the data.
col(‘rank’) == 1 can be updated to any desired value to get top N number of occurrences of that column data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s