Reputation: 1168
I have a dataframe with entries in this format:
user_id,item_list
0,3569 6530 4416 5494 6404 6289 10227 5285 3601 3509 5553 14879 5951 4802 15104 5338 3604 2345 9048 8627
1,16148 8470 7671 8984 9795 6811 3851 3611 7662 5034 5301 6948 5840 345 14652 10729 8429 7295 4949 16144
...
*Note that the user_id is not an index of the dataframe
I want to transform the dataframe into one that looks like this:
user_id,item_id
0,3569
0,6530
0,4416
0,5494
...
1,4949
1,16144
...
Right now I am trying this but it is wildly inefficient:
df = pd.read_csv("20recs.csv")
numberOfRows = 28107*20
df2 = pd.DataFrame(index=np.arange(0, numberOfRows),columns=('user', 'item'))
iter = 0
for index, row in df.iterrows():
user = row['user_id']
itemList = row['item_list']
items = itemList.split(' ')
for item in items:
df2.loc[iter] = [user]+[item]
iter = iter + 1
As you can see, I even tried pre-allocating the memory for the dataframe but it doesn't seem to help much.
So there must be a much better way to do this. Can anyone help me?
Upvotes: 3
Views: 147
Reputation: 7204
Try this:
df.set_index('user_id').item_list.apply(lambda x: x.split(' ')).explode().reset_index().set_index('user_id')
output
item_list
user_id
0 3569
0 6530
0 4416
0 5494
0 6404
0 6289
0 10227
0 5285
0 3601
0 3509
0 5553
0 14879
0 5951
0 4802
0 15104
0 5338
0 3604
0 2345
0 9048
0 8627
1 16148
1 8470
1 7671
1 8984
1 9795
1 6811
1 3851
1 3611
1 7662
1 5034
1 5301
1 6948
1 5840
1 345
1 14652
1 10729
1 8429
1 7295
1 4949
1 16144
or if you want an index:
df.set_index('user_id').item_list.apply(lambda x: x.split(' ')).explode().reset_index()
Upvotes: 1
Reputation: 648
First your item_id column should be a list
df['item_id_list'] = df['item_id'].str.split(',').values.tolist()
df['item_id_list_int'] = [[int(i) for i in x] for x in df['item_id_list']]
Then you explode it
df_exp = df.explode('item_id_list_int')
Upvotes: 1
Reputation: 6642
Use split
to transform the lists to actual lists, then explode
to ... well, explode the DataFrame. Requires pandas >= 0.25.0
>>> df = pd.DataFrame({'user_id': [0,1], 'item_list': ['1 2 3', '4 5 6']})
>>> df
user_id item_list
0 0 1 2 3
1 1 4 5 6
>>> (df.assign(item_id=df.item_list.apply(lambda x: x.split(' ')))
.explode('item_id')[['user_id', 'item_id']])
user_id item_id
0 0 1
0 0 2
0 0 3
1 1 4
1 1 5
1 1 6
Upvotes: 1