Wouter
Wouter

Reputation: 11

List of lists into dataframe in pandas

I have a list of lists that I want to turn into a dataframe, keeping their index in the original list as well.

x = [["a", "b", "c"], ["A", "B"], ["AA", "BB", "CC"]]

I can do this with a for loop like this:

result = []
for id, row in enumerate(x):
    d = pd.DataFrame({"attr": row, "id": [id]*len(row)})
    result.append(d)
result = pd.concat(result, ignore_index=True)

Or the equivalent generator expression:

pd.concat((pd.DataFrame({"attr": row, "id": [id]*len(row)}) 
           for id, row in enumerate(x)), ignore_index=True)

Both works fine, producing a data frame like:

id  attr
0   0   a
1   0   b
2   0   c
3   1   A
4   1   B
5   2   AA
6   2   BB
7   2   CC

But it feels like there should be a more 'panda-esque' way of doing it than with a list-loop-append pattern or the equivalent generator.

Can I create the dataframe above with a pandas call, i.e. without the for loop or python comprehension?

(preferably also a faster solution: on the 'genres' of the movie lens data set at https://grouplens.org/datasets/movielens/ this takes >4 seconds to flatten list of genres per movie, even though it is only 20k entries in total...)

Upvotes: 1

Views: 659

Answers (2)

A.Kot
A.Kot

Reputation: 7913

I believe stack() is what you are looking for:

pd.DataFrame(x).stack().reset_index().drop('level_1', axis=1)

Upvotes: 1

andrew
andrew

Reputation: 4089

It seems to me that what you need is a fast way to flatten that x list and also create another list of ids. There is a well read post on efficiently flattening lists.

You can just tweak the basic flattening list comprehension to quickly generate your ids.

x = [["a", "b", "c"], ["A", "B"], ["AA", "BB", "CC"]]
attr = [attr for sublist in  x for attr in sublist]
id = [id for sublist in  [[i]*len(r) for i,r in enumerate(x)] for id in sublist]
df = pd.DataFrame({'attr': attr, 'id': id })
df
>>>  
  attr  id
0    a   0
1    b   0
2    c   0
3    A   1
4    B   1
5   AA   2
6   BB   2
7   CC   2

# Testing the time to flatten 20k nested lists
import timeit

setup = '''
vals = [[1], [1,2], [1,2,3], [1,2,3,4]]*5000
lots_of_ids = [attr for sublist in  [[i]*len(r) for i,r in enumerate(vals)] for attr in sublist]
'''

print min(timeit.Timer(setup=setup).repeat(10))
>>> 0.0471019744873

Upvotes: 0

Related Questions