Reputation: 73
I have a Pandas Dataframe with a million rows (ids) with one of the columns as a list of lists. e.g.
df = pd.DataFrame({'id' : [1,2,3,4] ,'token_list' : [['a','b','c'],['c','d'],['a','e','f'],['c','f']]})
I want to create a dictionary of all the unique tokens - 'a', 'b', 'c', 'e', 'f' (which i already have as a separate list) as keys and all the ids that each key is associated with. For eg, {'a' : [1,3], 'b': [1], 'c': [1, 2,4]..} and so on.
My problem is there are 12000 such tokens, and I do not want to use loops to run through each row in the first frame. And is in does not seem to work.
Upvotes: 1
Views: 200
Reputation: 323326
df.set_index('id')['token_list'].\
apply(pd.Series).stack().reset_index(name='V').\
groupby('V')['id'].apply(list).to_dict()
Out[359]: {'a': [1, 3], 'b': [1], 'c': [1, 2, 4], 'd': [2], 'e': [3], 'f': [3, 4]}
Upvotes: 2
Reputation: 863266
Use np.repeat
with numpy.concatenate
for flattening first and thengroupby
with list
and last to_dict
:
a = np.repeat(df['id'], df['token_list'].str.len())
b = np.concatenate(df['token_list'].values)
d = a.groupby(b).apply(list).to_dict()
print (d)
{'c': [1, 2, 4], 'a': [1, 3], 'b': [1], 'd': [2], 'e': [3], 'f': [3, 4]}
Detail:
print (a)
0 1
0 1
0 1
1 2
1 2
2 3
2 3
2 3
3 4
3 4
Name: id, dtype: int64
print (b)
['a' 'b' 'c' 'c' 'd' 'a' 'e' 'f' 'c' 'f']
Upvotes: 2