meanshift clustering using pyspark

Question

We're trying to migrate a vanilla python code-base to pyspark. The agenda is to do some filtering on a dataframe (previously pandas, now spark), then group it by user-ids, and finally apply meanshift clustering on top.

I'm using pandas_udf(df.schema, PandasUDFType.GROUPED_MAP) on the grouped-data. But now there's a problem in the way the final output should be represented.

Let's say we have two columns in the input dataframe, user-id and location. For each user we need to get all clusters (on the location), retain only the biggest one, and then return its attributes, which is a 3-dimensional vector. Let's assume the columns of the 3-tuple are col-1, col-2 and col-3. I can only think of creating the original dataframe with 5 columns, with these 3 fields set to None, using something like withColumn('col-i', lit(None).astype(FloatType())). Then, in the first row for each user, I'm planning to populate these three columns with these attributes. But this seems really ugly way of doing it, and it would unnecessarily waste a lot of space, because apart from the first row, all entries in col-1, col-2 and col-3 would be zero. The output dataframe would look something like below in this case:

+---------+----------+-------+-------+-------+
| user-id | location | col-1 | col-2 | col-3 |
+---------+----------+-------+-------+-------+
| 02751a9 | 0.894956 |  21.9 |  31.5 |  54.1 |
| 02751a9 | 0.811956 |  null |  null |  null |
| 02751a9 | 0.954956 |  null |  null |  null |
|                     ...                    |
| 02751a9 | 0.811956 |  null |  null |  null |
+--------------------------------------------+
| 0af2204 | 0.938011 |  11.1 |  12.3 |  53.3 |
| 0af2204 | 0.878081 |  null |  null |  null |
| 0af2204 | 0.933054 |  null |  null |  null |
| 0af2204 | 0.921342 |  null |  null |  null |
|                     ...                    |
| 0af2204 | 0.978081 |  null |  null |  null |
+--------------------------------------------+

This feels so wrong. Is there an elegant way of doing it?

Bitswazsky · Accepted Answer

What I ended up doing, was grouped the df by user-ids, applied functions.collect_list on the columns, so that each cell contains a list. Now each user has only one row. Then I applied meanshift clustering on each row's data.

meanshift clustering using pyspark

Answers (1)

Related Questions