Reputation: 31
I am using Pandas on Spark. I need to groupby A and B and then aggregate to return a list of map where keys are C and values are D Sample input:
A B C D
0 7 201806851 0006378110 2223982011
1 7 6378110 0006378110 2223982011
2 7 201806851 201806851 20972475011
3 7 6378110 201806851 20972475011
Sample output:
A B C
0 7 6378110 [[0006378110, 2223982011], [201806851, 20972475011]]
1 7 201806851 [[0006378110, 2223982011], [201806851, 20972475011]]
This is my code. It's giving the error, assert len(key) == len(that_column_labels) AssertionError
on the first line. Any idea?
seed_data["C"] = seed_data[["C", "D"]].to_dict('records')
seed_data = (seed_data
.groupby(["A", "B"])["C"]
.apply(list).reset_index(name="C"))
Tried a few things, like extracting columns C and D into a separate dataframe, convert to dict and then using it as an aggregate column. But getting assertion error.
Upvotes: 1
Views: 69
Reputation: 71689
This operation can be done in an efficient manner using the native spark functions. In pyspark, you need to group the dataframe then collect the list of pairs of c -> D
df.groupBy('A', 'B').agg(F.collect_list(F.create_map('C', 'D')).alias('CD'))
+---+---------+--------------------------------------------------------+
|A |B |CD |
+---+---------+--------------------------------------------------------+
|7 |201806851|[{0006378110 -> 2223982011}, {201806851 -> 20972475011}]|
|7 |6378110 |[{0006378110 -> 2223982011}, {201806851 -> 20972475011}]|
+---+---------+--------------------------------------------------------+
Upvotes: 0