PotatoInt
PotatoInt

Reputation: 33

How to merge rows in pandas with similar data

I got data with columns: startpoint, endpoint, data.

I want to merge the startpoint, endpoint rows if they contain the same data (to both directions) and add another columns of the extra data.

for example starting with:

startpoint endpoint data
A B 1
C D 2
B A 3
D C 4

TO:

startpoint endpoint data_1 data_2
A B 1 3
C D 2 4

Is there quick way to do it on pandas?

Thanks.

Upvotes: 3

Views: 208

Answers (4)

sophocles
sophocles

Reputation: 13821

Get matching values values between startpoint and endpoint and do the necessary conversions for your data types. You can then sort the values in your dataframe and use a groupby:

# Matching values between startpoint + endpoint
df['start_end_grouped'] = [sorted(''.join(val).replace(',','')) for val in zip(df['startpoint'], df['endpoint'])]

# Conversions
df['data'] = df['data'].str.replace(',','').astype(float)
df['start_end_grouped'] = df['start_end_grouped'].astype(str)

# Result

df[['data','start_end_grouped']].sort_values(by=['start_end_grouped','data'])\
    .groupby('start_end_grouped',as_index=False).agg(list)

  start_end_grouped        data
0        ['A', 'B']  [1.0, 3.0]
1        ['C', 'D']  [2.0, 4.0]

Upvotes: 0

piterbarg
piterbarg

Reputation: 8219

First we groupby on an index where we sort values in startpoint, endpoint to make sure we get match permutations

match_groups = ['_'.join(sorted(t)) for t in zip(df['startpoint'],df['endpoint'])]
df2 = df.groupby(match_groups, as_index = False).agg({'startpoint':'first', 'endpoint':'first', 'data':list})

df2 looks like this:

    startpoint    endpoint    data
--  ------------  ----------  ------
 0  A             B           [1, 3]
 1  C             D           [2, 4]

if we want data in separate columns then we apply pd.Series (and rename columns to desired labels)

(df2.set_index(['startpoint', 'endpoint'])['data']
    .apply(pd.Series).rename(columns = lambda n:f'data_{n+1}')
    .reset_index()
)

output:

    startpoint    endpoint      data_1    data_2
--  ------------  ----------  --------  --------
 0  A             B                  1         3
 1  C             D                  2         4

Upvotes: 3

Green Noob
Green Noob

Reputation: 408

If I've understood your question correctly, the following code should do what you want -

data.index = [hash(frozenset([x,y])) for x, y in zip(data["startpoint"], data["endpoint"])]
result = data.groupby(data.index)["data"].apply(list).to_frame()
result = result["data"].apply(pd.Series)
result.columns = ["data1", "data2"]
result = pd.merge(data[["startpoint", "endpoint"]], result, left_index=True, right_index=True)
result = result[~result.index.duplicated(keep='first')]

The variable data is the original DataFrame. Please let me know if anything is unclear.

Upvotes: 1

kazi.tanvir
kazi.tanvir

Reputation: 19

Your best bet is to use pd.merge(). Pandas official website shows how to use pd.merge() functions.

https://pandas.pydata.org/docs/dev/user_guide/merging.html

Upvotes: 0

Related Questions