daiyue
daiyue

Reputation: 7448

pandas how to flatten a list in a column while keeping list ids for each element

I have the following df,

 A                                                          id
[ObjectId('5abb6fab81c0')]                                  0
[ObjectId('5abb6fab81c3'),ObjectId('5abb6fab81c4')]         1
[ObjectId('5abb6fab81c2'),ObjectId('5abb6fab81c1')]         2

I like to flatten each list in A, and assign its corresponding id to each element in the list like,

 A                               id
 ObjectId('5abb6fab81c0')        0
 ObjectId('5abb6fab81c3')        1
 ObjectId('5abb6fab81c4')        1
 ObjectId('5abb6fab81c2')        2
 ObjectId('5abb6fab81c1')        2

Upvotes: 2

Views: 1788

Answers (3)

vozman
vozman

Reputation: 1416

Flattening and unflattening can be done using this function

def flatten(df, col):
    col_flat = pd.DataFrame([[i, x] for i, y in df[col].apply(list).iteritems() for x in y], columns=['I', col])
    col_flat = col_flat.set_index('I')
    df = df.drop(col, 1)
    df = df.merge(col_flat, left_index=True, right_index=True)

    return df

Unflattening:

def unflatten(flat_df, col):
    flat_df.groupby(level=0).agg({**{c:'first' for c in flat_df.columns}, col: list})

After unflattening we get the same dataframe except column order:

(df.sort_index(axis=1) == unflatten(flatten(df)).sort_index(axis=1)).all().all()
>> True

To create unique index you can call reset_index after flattening

Upvotes: 0

erekalper
erekalper

Reputation: 887

This probably isn't the most elegant solution, but it works. The idea here is to loop through df (which is why this is likely an inefficient solution), and then loop through each list in column A, appending each item and the id to new lists. Those two new lists are then turned into a new DataFrame.

a_list = []
id_list = []
for index, a, i in df.itertuples():
    for item in a:
        a_list.append(item)
        id_list.append(i)
df1 = pd.DataFrame(list(zip(alist, idlist)), columns=['A', 'id'])

As I said, inelegant, but it gets the job done. There's probably at least one better way to optimize this, but hopefully it gets you moving forward.

EDIT (April 2, 2018)

I had the thought to run a timing comparison between mine and Wen's code, simply out of curiosity. The two variables are the length of column A, and the length of the list entries in column A. I ran a bunch of test cases, iterating by orders of magnitude each time. For example, I started with A length = 10 and ran through to 1,000,000, at each step iterating through randomized A entry list lengths of 1-10, 1-100 ... 1-1,000,000. I found the following:

  • Overall, my code is noticeably faster (especially at increasing A lengths) as long as the list lengths are less than ~1,000. As soon as the randomized list length hits the ~1,000 barrier, Wen's code takes over in speed. This was a huge surprise to me! I fully expected my code to lose every time.
  • Length of column A generally doesn't matter - it simply increases the overall execution time linearly. The only case in which it changed the results was for A length = 10. In that case, no matter the list length, my code ran faster (also strange to me).

Conclusion: If the list entries in A are on the order of a few hundred elements (or less) long, my code is the way to go. But if you're working with huge data sets, use Wen's! Also worth noting that as you hit the 1,000,000 barrier, both methods slow down drastically. I'm using a fairly powerful computer, and each were taking minutes by the end (it actually crashed on the A length = 1,000,000 and list length = 1,000,000 case).

Upvotes: 2

BENY
BENY

Reputation: 323276

I think the comment is coming from this question ? you can using my original post or this one

df.set_index('id').A.apply(pd.Series).stack().reset_index().drop('level_1',1)
Out[497]: 
   id    0
0   0  1.0
1   1  2.0
2   1  3.0
3   1  4.0
4   2  5.0
5   2  6.0

Or

pd.DataFrame({'id':df.id.repeat(df.A.str.len()),'A':df.A.sum()})
Out[498]: 
   A  id
0  1   0
1  2   1
1  3   1
1  4   1
2  5   2
2  6   2

Upvotes: 3

Related Questions