Pie-ton
Pie-ton

Reputation: 562

How to concatenate rows with the same column value in python?

I have a tweeter data set (6,000,000+ rows) and want to extract the conversation threads from it. Let's assume it looks like this:

data = pd.DataFrame({'Tweet_ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
                   'Reply_to_tweet_ID': [None, '1', '1', '3', None, '1', '4', '2', '4'],
                   'Max Speed': [380., 370., 24., 26., 584., 48., 8., 123., None]})
    Tweet_ID    Reply_to_tweet_ID   Max Speed
0   1           None                380
1   2           1                   370
2   3           1                   24
3   4           3                   26
4   5           None                584
5   6           1                   48
6   7           4                   8
7   8           2                   123
8   9           4                   None

Basically, I should match Reply_to_tweet_ID with Tweet_ID and concatenate the results of this match in one row. The results should look something like this:

Tweet_ID1   Reply_to_tweet_ID1  Max Speed1  Tweet_ID2   Reply_to_tweet_ID2  Max Speed2  Tweet_ID3   Reply_to_tweet_ID3  Max Speed3  Tweet_ID4   Reply_to_tweet_ID4  Max Speed4
1           None                380         2           1                   370         3           1                   24          6           1                   48
2           1                   370         8           2                   123
3           1                   24          4           3                   26
4           3                   26          7           4                   8           9           4                   None

A similar question has been asked here, but the answer is not true.

My code is:

df = data.set_index(['Reply_to_tweet_ID', data.groupby('Reply_to_tweet_ID')\
.cumcount().add(1)])[['Tweet_ID','Max Speed']]\
.unstack().reset_index()

df.columns = ["{}{}".format(a, b) for a, b in df.columns]

df = df[df.Reply_to_tweet_ID != 'None']

but the result looks like this:

    Reply_to_tweet_ID   Tweet_ID1   Tweet_ID2   Tweet_ID3   Max Speed1  Max Speed2  Max Speed3
0                   1           2           3           6          370          24          48
1                   2           8           NaN         NaN        123          NaN                 NaN
2                   3           4           NaN         NaN        26           NaN                 NaN
3                   4           7           9           NaN        8           None                 NaN

Upvotes: 1

Views: 918

Answers (1)

Scott Boston
Scott Boston

Reputation: 153460

IIUC, you can do this with a 'self-join' using merge, then reshaping the dataframe, flatten multiindex column headers and merge back to original dataframe:

#Create merged dataframe, data_m, to join Reply_to_tweet_ID to Tweet_ID
data_m = data[['Tweet_ID']].merge(data[['Reply_to_tweet_ID','Max Speed','Tweet_ID']], 
                                               left_on='Tweet_ID', 
                                               right_on='Reply_to_tweet_ID',
                                              suffixes=('','_y'))

#Use `set_index` with `groupby` and `cumcount` then `unstack` to 
#reshape long to wide for dataframe, data_u
data_u = data_m.set_index(['Tweet_ID', data_m.groupby('Tweet_ID').cumcount()+1]).unstack()
data_u = data_u.sort_index(axis=1, level=1)

#Flatten multiindex column header using list comprenhension    
data_u.columns = [f'{i}{j}' for i, j in data_u.columns]

#merge dataframe, data_u to the orginal dataframe, data
print(data.merge(data_u, on='Tweet_ID'))

Output:

  Tweet_ID Reply_to_tweet_ID  Max Speed  Max Speed1 Reply_to_tweet_ID1 Tweet_ID_y1  Max Speed2 Reply_to_tweet_ID2 Tweet_ID_y2  Max Speed3 Reply_to_tweet_ID3 Tweet_ID_y3
0        1              None      380.0       370.0                  1           2        24.0                  1           3        48.0                  1           6
1        2                 1      370.0       123.0                  2           8         NaN                NaN         NaN         NaN                NaN         NaN
2        3                 1       24.0        26.0                  3           4         NaN                NaN         NaN         NaN                NaN         NaN
3        4                 3       26.0         8.0                  4           7         NaN                  4           9         NaN                NaN         NaN

Upvotes: 1

Related Questions