Reputation: 452
data = {
'date': ['2020-04-27', '2020-04-27', '2020-04-27'],
'user': ['Steeve', 'Pam', 'Olive'],
'mentions': ["['sport', 'basket']", "['politique']", "[]"],
'reply_to': [
"[{'user_id': '123', 'username': 'aaa'}, {'user_id': '234', 'username': 'bbb'}, {'user_id': '456', 'username': 'ccc'}]",
"[{'user_id': '567', 'username': 'zzz'}, {'user_id': '458', 'username': 'vfd'}]",
"[{'user_id': '666', 'username': 'ggg'}]"],
'text': ['textfromSteeve', 'textfromPam', 'textfromOlive']
}
stack = pd.DataFrame(data, columns=['date', 'user','mentions','reply_to','text'])
From this dataframe, I'm trying to convert both mentions
and reply_to
columns into nested list. The goal is then to apply an pandas explode function to display one row for each number of mentions. For instance, I'd like 3 rows of user 'Pam' with one mention for each line (Steeve, Olive and Marc).
So far, I've done the following:
def nested_list(li):
temp = []
for elem in li:
temp.append([elem])
return temp
stack['mentions_nested= stack.mentions.apply(lambda x: nested_list(x))
stack['replies_nested= stack.reply_to.apply(lambda x: nested_list(x))
The problem is when there is only only one name (string) in the column. It splits each letter into a distinct list (ex: [[P], [a], [m]]).
Regarding to the reply_to column, where the dictionary's length is equal to 1, it returns something like this: [[id],[username]].
Do you guys have any idea on how I could do this?
FYI: I'm not going to apply the explode function on both mentions an reply_to columns on the meantime. This is going to be two different process.python
Upvotes: 0
Views: 1637
Reputation: 28644
use json normalize on the reply to column
from pandas import json_normalize
res = pd.concat([json_normalize(ent)
.assign(index=ind)
for ind,ent
in zip(stack.index,stack['reply_to'])
])
res
id username index
0 123 alpha 0
0 231 beta 1
1 4580 omega 1
0 789 olo 2
join back to the original dataframe, with some cleanups
(stack
.join(res.set_index('index'))
.explode('mentions')
.drop('reply_to',axis=1)
)
user mentions text id username
0 Steeve Pam textfromSteeve 123 alpha
1 Pam Steeve textfromPam 231 beta
1 Pam Olive textfromPam 231 beta
1 Pam Marc textfromPam 231 beta
1 Pam Steeve textfromPam 4580 omega
1 Pam Olive textfromPam 4580 omega
1 Pam Marc textfromPam 4580 omega
2 Olive Paul textfromOlive 789 olo
2 Olive Lou textfromOlive 789 olo
Upvotes: 0
Reputation: 862601
I believe you need replace non list values to lists with map
and isinstance
:
for c in ['mentions','reply_to']:
stack[c] = stack[c].map(lambda x: x if isinstance(x, list) else [x])
print (stack)
user mentions \
0 Steeve [Pam]
1 Pam [Steeve, Olive, Marc]
2 Olive [Paul, Lou]
reply_to text
0 [{'id': '123', 'username': 'alpha'}] textfromSteeve
1 [{'id': '231', 'username': 'beta'}, {'id': '45... textfromPam
2 [{'id': '789', 'username': 'olo'}] textfromOlive
Then is possible create dictionaries in list comprehension with assign key from index values, pass to DataFrame
and last use DataFrame.join
for original, last use DataFrame.explode
:
L = [dict(**{'idx':k}, **x) for
k, v in stack.pop('reply_to').items()
for x in v]
df = pd.DataFrame(L).join(stack, on='idx').explode('mentions').reset_index(drop=True)
print (df)
idx id username user mentions text
0 0 123 alpha Steeve Pam textfromSteeve
1 1 231 beta Pam Steeve textfromPam
2 1 231 beta Pam Olive textfromPam
3 1 231 beta Pam Marc textfromPam
4 1 4580 omega Pam Steeve textfromPam
5 1 4580 omega Pam Olive textfromPam
6 1 4580 omega Pam Marc textfromPam
7 2 789 olo Olive Paul textfromOlive
8 2 789 olo Olive Lou textfromOlive
Upvotes: 1