How to retain the first row of each ‘group’ in a PySpark DataFrame?

PiyushJuly 8, 2020July 11, 2020Apache Spark, Data Engineering, PySpark

For a given dataframe, with multiple occurrence of a particular column value, one may desire to retain only one (or N number) of those occurrences.

from pyspark.sql.window import Window
from pyspark.sql import Row
from pyspark.sql.functions import *

  
df = sc.parallelize([ \
     Row(name='Bob', age=5, height=80), \
     Row(name='Alice', age=5, height=90), \
     Row(name='Bob', age=5, height=80), \
     Row(name='Alice', age=5, height=75), \
     Row(name='Alice', age=10, height=80)]).toDF()

df.show()

#+---+------+-----+
#|age|height| name|
#+---+------+-----+
#|  5|    80|  Bob|
#|  5|    90|Alice|
#|  5|    80|  Bob|
#|  5|    75|Alice|
#| 10|    80|Alice|
#+---+------+-----+

dropDuplicates can be used when one doesn’t care about which occurrence of the data is to be dropped. eg. if in the above data, you do not care about Alice’s age or height and would desire any one row containing “name=Alice”

df.dropDuplicates(["name"]).show()

#+---+------+-----+
#|age|height| name|
#+---+------+-----+
#|  5|    80|  Bob|
#| 10|    80|Alice|
#+---+------+-----+
#
#______or
#
#+---+------+-----+
#|age|height| name|
#+---+------+-----+
#|  5|    80|  Bob|
#|  5|    75|Alice|
#+---+------+-----+

However, for some reason, if you need to retain only the first occurrence of that column value. Imagine the dataframe is ordered by certain column(s) and thus you cannot afford just any occurrence of the data. Then you have to use window function in spark.
[Relevant imports were already done in first code snippet.]

window = Window.partitionBy("name").orderBy('tiebreak')

df.withColumn('tiebreak', monotonically_increasing_id())\
 .withColumn('rank', rank().over(window))\
 .filter(col('rank') == 1).drop('rank','tiebreak')\
 .show()

#+---+------+-----+
#|age|height| name|
#+---+------+-----+
#|  5|    80|  Bob|
#|  5|    90|Alice|
#+---+------+-----+

Unlike dropDuplicates(), every time the window-rank code method is run, you will get only the first occurrence of the data.
col(‘rank’) == 1 can be updated to any desired value to get top N number of occurrences of that column data.

Published by Piyush

View all posts by Piyush

Leave a comment Cancel reply