user3242036
user3242036

Reputation: 705

PySpark GroupBy - Keep Value or Null if No Value

I'm coding in PySpark and have a data frame that has tokens and their associated phrases. The same phrase can appear in multiple rows so I want to groupby so that there is only one row of the phrase, but I want to keep the one that has an associated descriptor. If there is no descriptor, I want to keep one row with the null. Example data set:


+------------------------------------+--------+-------+---------+------------+-----------+
|            SENTENCE                | SENT_ID|  TOKEN| TOKEN_ID|     PHRASE | DESCRIPTOR|
+------------------------------------+--------+-------+---------+------------+-----------+
|The handle of the old razor blade...|       1| handle|        2|      handle|       null|
|The handle of the old razor blade...|       1|  razor|        6| razor blade|       null|
|The handle of the old razor blade...|       1|  blade|        7| razor blade|        old|

I want it to look like:

+------------------------------------+--------+------------+-----------+
|            SENTENCE                | SENT_ID|     PHRASE | DESCRIPTOR|
+------------------------------------+--------+------------+-----------+
|The handle of the old razor blade...|       1|      handle|       null|
|The handle of the old razor blade...|       1| razor blade|        old|

There will never be a situation where there are different descriptors for the same phrase. I'm thinking something like df.groupby('REVIEW_ID','SENT_ID','PHRASE') but not sure how to bring in the descriptor.

Upvotes: 3

Views: 4513

Answers (1)

notNull
notNull

Reputation: 31460

Use collect_list or collect_set functions to get descriptor values.

  • collect_list,collect_set doesn't preserve null values for this case use when otherwise to replace with string null.

Example:

df.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#|  1|   a|   100|
#|  1|null|   200|
#|  1|null|   300|
#+---+----+------+

#grouping by id and collecting names

df.groupBy("id").agg(collect_list(col("name")).alias("list")).show()
#+---+----+
#| id|list|
#+---+----+
#|  1| [a]|
#+---+----+

#preserve nulls without duplicates
df.groupBy("id").\
agg(concat_ws(",",collect_list(when(isnull(col("name")),lit('null')).otherwise(col("name")))).alias("list")).\
show()
#+---+-----------+
#| id|       list|
#+---+-----------+
#|  1|a,null,null|
#+---+-----------+

#preserve nulls without duplicates
df.groupBy("id").\
agg(concat_ws(",",collect_set(when(isnull(col("name")),lit('null')).otherwise(col("name")))).alias("list")).\
show()
+---+------+
| id|  list|
+---+------+
|  1|a,null|
+---+------+

Upvotes: 3

Related Questions