BlackHat
BlackHat

Reputation: 755

Faster way to flatten list in Pandas Dataframe

I have a dataframe below:

import pandas
df = pandas.DataFrame({"terms" : [[['the', 'boy', 'and', 'the goat'],['a', 'girl', 'and', 'the cat']], [['fish', 'boy', 'with', 'the dog'],['when', 'girl', 'find', 'the mouse'], ['if', 'dog', 'see', 'the cat']]]})

My desired outcome is as follows:

df2 = pandas.DataFrame({"terms" : ['the boy  and the goat','a girl and the cat',  'fish boy with the dog','when girl find the mouse', 'if dog see the cat']})

Is there a simple way to accomplish this without having to use a for loop to iterate through each row for each element and substring:

result = pandas.DataFrame()
for i in range(len(df.terms.tolist())):
    x = df.terms.tolist()[i]
    for y in x:
        z = str(y).replace(",",'').replace("'",'').replace('[','').replace(']','')
        flattened = pandas.DataFrame({'flattened_term':[z]})
        result = result.append(flattened)

print(result)

Thank you.

Upvotes: 2

Views: 209

Answers (1)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 95957

This is certainly no way to avoid loops here, at least not implicitely. Pandas is not created to handle list objects as elements, it deals magnificently with numeric data, and pretty well with strings. In any case, your fundamental problem is that you are using pd.Dataframe.append in a loop, which is a quadratic time algorithm (the entire data-frame is re-created on each iteration). But you can probably just get away with the following, and it should be significantly faster:

>>> df
                                               terms
0  [[the, boy, and, the goat], [a, girl, and, the...
1  [[fish, boy, with, the dog], [when, girl, find...
>>> pandas.DataFrame([' '.join(term) for row in df.itertuples() for term in row.terms])
                          0
0      the boy and the goat
1        a girl and the cat
2     fish boy with the dog
3  when girl find the mouse
4        if dog see the cat
>>>

Upvotes: 3

Related Questions