Reputation: 562
I have a tweeter data set (6,000,000+ rows) and want to extract the conversation threads from it. Let's assume it looks like this:
data = pd.DataFrame({'Tweet_ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
'Reply_to_tweet_ID': [None, '1', '1', '3', None, '1', '4', '2', '4'],
'Max Speed': [380., 370., 24., 26., 584., 48., 8., 123., None]})
Tweet_ID Reply_to_tweet_ID Max Speed
0 1 None 380
1 2 1 370
2 3 1 24
3 4 3 26
4 5 None 584
5 6 1 48
6 7 4 8
7 8 2 123
8 9 4 None
Basically, I should match Reply_to_tweet_ID
with Tweet_ID
and concatenate the results of this match in one row. The results should look something like this:
Tweet_ID1 Reply_to_tweet_ID1 Max Speed1 Tweet_ID2 Reply_to_tweet_ID2 Max Speed2 Tweet_ID3 Reply_to_tweet_ID3 Max Speed3 Tweet_ID4 Reply_to_tweet_ID4 Max Speed4
1 None 380 2 1 370 3 1 24 6 1 48
2 1 370 8 2 123
3 1 24 4 3 26
4 3 26 7 4 8 9 4 None
A similar question has been asked here, but the answer is not true.
My code is:
df = data.set_index(['Reply_to_tweet_ID', data.groupby('Reply_to_tweet_ID')\
.cumcount().add(1)])[['Tweet_ID','Max Speed']]\
.unstack().reset_index()
df.columns = ["{}{}".format(a, b) for a, b in df.columns]
df = df[df.Reply_to_tweet_ID != 'None']
but the result looks like this:
Reply_to_tweet_ID Tweet_ID1 Tweet_ID2 Tweet_ID3 Max Speed1 Max Speed2 Max Speed3
0 1 2 3 6 370 24 48
1 2 8 NaN NaN 123 NaN NaN
2 3 4 NaN NaN 26 NaN NaN
3 4 7 9 NaN 8 None NaN
Upvotes: 1
Views: 918
Reputation: 153460
IIUC, you can do this with a 'self-join' using merge
, then reshaping the dataframe, flatten multiindex column headers and merge
back to original dataframe:
#Create merged dataframe, data_m, to join Reply_to_tweet_ID to Tweet_ID
data_m = data[['Tweet_ID']].merge(data[['Reply_to_tweet_ID','Max Speed','Tweet_ID']],
left_on='Tweet_ID',
right_on='Reply_to_tweet_ID',
suffixes=('','_y'))
#Use `set_index` with `groupby` and `cumcount` then `unstack` to
#reshape long to wide for dataframe, data_u
data_u = data_m.set_index(['Tweet_ID', data_m.groupby('Tweet_ID').cumcount()+1]).unstack()
data_u = data_u.sort_index(axis=1, level=1)
#Flatten multiindex column header using list comprenhension
data_u.columns = [f'{i}{j}' for i, j in data_u.columns]
#merge dataframe, data_u to the orginal dataframe, data
print(data.merge(data_u, on='Tweet_ID'))
Output:
Tweet_ID Reply_to_tweet_ID Max Speed Max Speed1 Reply_to_tweet_ID1 Tweet_ID_y1 Max Speed2 Reply_to_tweet_ID2 Tweet_ID_y2 Max Speed3 Reply_to_tweet_ID3 Tweet_ID_y3
0 1 None 380.0 370.0 1 2 24.0 1 3 48.0 1 6
1 2 1 370.0 123.0 2 8 NaN NaN NaN NaN NaN NaN
2 3 1 24.0 26.0 3 4 NaN NaN NaN NaN NaN NaN
3 4 3 26.0 8.0 4 7 NaN 4 9 NaN NaN NaN
Upvotes: 1