Reputation: 705
I'm coding in PySpark and have a data frame that has tokens and their associated phrases. The same phrase can appear in multiple rows so I want to groupby so that there is only one row of the phrase, but I want to keep the one that has an associated descriptor. If there is no descriptor, I want to keep one row with the null. Example data set:
+------------------------------------+--------+-------+---------+------------+-----------+
| SENTENCE | SENT_ID| TOKEN| TOKEN_ID| PHRASE | DESCRIPTOR|
+------------------------------------+--------+-------+---------+------------+-----------+
|The handle of the old razor blade...| 1| handle| 2| handle| null|
|The handle of the old razor blade...| 1| razor| 6| razor blade| null|
|The handle of the old razor blade...| 1| blade| 7| razor blade| old|
I want it to look like:
+------------------------------------+--------+------------+-----------+
| SENTENCE | SENT_ID| PHRASE | DESCRIPTOR|
+------------------------------------+--------+------------+-----------+
|The handle of the old razor blade...| 1| handle| null|
|The handle of the old razor blade...| 1| razor blade| old|
There will never be a situation where there are different descriptors for the same phrase. I'm thinking something like
df.groupby('REVIEW_ID','SENT_ID','PHRASE')
but not sure how to bring in the descriptor.
Upvotes: 3
Views: 4513
Reputation: 31460
Use collect_list
or collect_set
functions to get descriptor
values.
collect_list
,collect_set
doesn't preserve null values for this case use when otherwise to replace with string null
.Example:
df.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#| 1| a| 100|
#| 1|null| 200|
#| 1|null| 300|
#+---+----+------+
#grouping by id and collecting names
df.groupBy("id").agg(collect_list(col("name")).alias("list")).show()
#+---+----+
#| id|list|
#+---+----+
#| 1| [a]|
#+---+----+
#preserve nulls without duplicates
df.groupBy("id").\
agg(concat_ws(",",collect_list(when(isnull(col("name")),lit('null')).otherwise(col("name")))).alias("list")).\
show()
#+---+-----------+
#| id| list|
#+---+-----------+
#| 1|a,null,null|
#+---+-----------+
#preserve nulls without duplicates
df.groupBy("id").\
agg(concat_ws(",",collect_set(when(isnull(col("name")),lit('null')).otherwise(col("name")))).alias("list")).\
show()
+---+------+
| id| list|
+---+------+
| 1|a,null|
+---+------+
Upvotes: 3