Reputation:
Let us say I have this data frame.
df
line to_line priority 10 20 1 10 30 1 50 40 3 60 70 2 50 80 3
Based on the line and priority column values (when the are the same or duplicate as shown above), I want to combine to_line values. The proposed result should look like the following.
line to_line priority 10 20/30 1 50 40/80 3 60 70 2
I tried something like this but I couldn't get what I want.
df.groupBy(col("line")).agg(collect_list(col("to_line")) as "to_line").withColumn("to_line", concat_ws(",", col("to_line")))
Could you please help to figure out this? I appreciate your time and effort.
Upvotes: 1
Views: 1871
Reputation: 2137
You can achieve this by custom aggregation function.
Code
df = pd.DataFrame({
'line': [10,10,50,60,50],
'to_line': [20,30,40,70,80],
'priority': [1,1,3,2,3]
})
array_agg = lambda x: '/'.join(x.astype(str))
grp_df = df.groupby(['line', 'priority']).agg({'to_line': array_agg})
, or if you do not want grouped columns to be indexes, you can pass as_index
argument to groupby
method
grp_df = df.groupby(['line', 'priority'], as_index=False).agg({'to_line': array_agg})
Output
to_line
line priority
10 1 20/30
50 3 40/80
60 2 70
Upvotes: 5
Reputation: 8033
You can use
df.groupby(['line','priority'])['to_line'].apply(lambda x: '/'.join(str(y) for y in x)).reset_index(name='to_line')
output
line priority to_line
0 10 1 20/30
1 50 3 40/80
2 60 2 70
Upvotes: 0
Reputation: 3930
Maybe something like this:
res = []
df.to_line = df.to_line.astype(str)
for line_priority, df_chunk in df.groupby(['line','priority']):
df_chunk = df_chunk.reset_index().sort_values('to_line')
to_line = "/".join(df_chunk.to_line.values)
res.append({'to_line':to_line,'priority':line_priority[1],'line':line_priority[0]})
pd.DataFrame(res)
Upvotes: 0