Reputation: 4698
We're trying to migrate a vanilla python code-base to pyspark. The agenda is to do some filtering on a dataframe (previously pandas, now spark), then group it by user-ids, and finally apply meanshift clustering on top.
I'm using pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
on the grouped-data. But now there's a problem in the way the final output should be represented.
Let's say we have two columns in the input dataframe, user-id
and location
. For each user we need to get all clusters (on the location
), retain only the biggest one, and then return its attributes, which is a 3-dimensional vector. Let's assume the columns of the 3-tuple are col-1
, col-2
and col-3
. I can only think of creating the original dataframe with 5 columns, with these 3 fields set to None
, using something like withColumn('col-i', lit(None).astype(FloatType()))
. Then, in the first row for each user, I'm planning to populate these three columns with these attributes. But this seems really ugly way of doing it, and it would unnecessarily waste a lot of space, because apart from the first row, all entries in col-1
, col-2
and col-3
would be zero. The output dataframe would look something like below in this case:
+---------+----------+-------+-------+-------+
| user-id | location | col-1 | col-2 | col-3 |
+---------+----------+-------+-------+-------+
| 02751a9 | 0.894956 | 21.9 | 31.5 | 54.1 |
| 02751a9 | 0.811956 | null | null | null |
| 02751a9 | 0.954956 | null | null | null |
| ... |
| 02751a9 | 0.811956 | null | null | null |
+--------------------------------------------+
| 0af2204 | 0.938011 | 11.1 | 12.3 | 53.3 |
| 0af2204 | 0.878081 | null | null | null |
| 0af2204 | 0.933054 | null | null | null |
| 0af2204 | 0.921342 | null | null | null |
| ... |
| 0af2204 | 0.978081 | null | null | null |
+--------------------------------------------+
This feels so wrong. Is there an elegant way of doing it?
Upvotes: 0
Views: 348
Reputation: 4698
What I ended up doing, was grouped the df by user-ids, applied functions.collect_list
on the columns, so that each cell contains a list. Now each user has only one row. Then I applied meanshift clustering on each row's data.
Upvotes: 0