Luca Guarro
Luca Guarro

Reputation: 1168

How to transform a dataframe with a column whose values are lists to a dataframe where each element of each list in that column becomes a new row

I have a dataframe with entries in this format:

user_id,item_list
0,3569 6530 4416 5494 6404 6289 10227 5285 3601 3509 5553 14879 5951 4802 15104 5338 3604 2345 9048 8627
1,16148 8470 7671 8984 9795 6811 3851 3611 7662 5034 5301 6948 5840 345 14652 10729 8429 7295 4949 16144
...

*Note that the user_id is not an index of the dataframe

I want to transform the dataframe into one that looks like this:

user_id,item_id
0,3569
0,6530
0,4416 
0,5494 
...
1,4949
1,16144
...

Right now I am trying this but it is wildly inefficient:

df = pd.read_csv("20recs.csv")
numberOfRows = 28107*20
df2 = pd.DataFrame(index=np.arange(0, numberOfRows),columns=('user', 'item'))
iter = 0
for index, row in df.iterrows():
    user = row['user_id']
    itemList = row['item_list']
    items = itemList.split(' ')
    for item in items:
        df2.loc[iter] = [user]+[item]
        iter = iter + 1

As you can see, I even tried pre-allocating the memory for the dataframe but it doesn't seem to help much.

So there must be a much better way to do this. Can anyone help me?

Upvotes: 3

Views: 147

Answers (3)

oppressionslayer
oppressionslayer

Reputation: 7204

Try this:

df.set_index('user_id').item_list.apply(lambda x: x.split(' ')).explode().reset_index().set_index('user_id') 

output

        item_list
user_id          
0            3569
0            6530
0            4416
0            5494
0            6404
0            6289
0           10227
0            5285
0            3601
0            3509
0            5553
0           14879
0            5951
0            4802
0           15104
0            5338
0            3604
0            2345
0            9048
0            8627
1           16148
1            8470
1            7671
1            8984
1            9795
1            6811
1            3851
1            3611
1            7662
1            5034
1            5301
1            6948
1            5840
1             345
1           14652
1           10729
1            8429
1            7295
1            4949
1           16144

or if you want an index:

df.set_index('user_id').item_list.apply(lambda x: x.split(' ')).explode().reset_index()

Upvotes: 1

SchwarzeHuhn
SchwarzeHuhn

Reputation: 648

First your item_id column should be a list

df['item_id_list'] = df['item_id'].str.split(',').values.tolist()
df['item_id_list_int'] = [[int(i) for i in x] for x in df['item_id_list']]

Then you explode it

df_exp = df.explode('item_id_list_int')

Upvotes: 1

mcsoini
mcsoini

Reputation: 6642

Use split to transform the lists to actual lists, then explode to ... well, explode the DataFrame. Requires pandas >= 0.25.0

>>> df = pd.DataFrame({'user_id': [0,1], 'item_list': ['1 2 3', '4 5 6']})
>>> df

   user_id item_list
0        0     1 2 3
1        1     4 5 6

>>> (df.assign(item_id=df.item_list.apply(lambda x: x.split(' ')))
       .explode('item_id')[['user_id', 'item_id']])

   user_id   item_id
0        0         1
0        0         2
0        0         3
1        1         4
1        1         5
1        1         6

Upvotes: 1

Related Questions