Outcast
Outcast

Reputation: 5117

Split text in cells and create additional rows for the tokens

Let's suppose that I have the following in a DataFrame in pandas:

id  text
1   I am the first document and I am very happy.
2   Here is the second document and it likes playing tennis.
3   This is the third document and it looks very good today.

and I want to split the text of each id in tokens of 3 words so I finally want to have the following:

id  text
1   I am the
1   first document and
1   I am very
1   happy
2   Here is the
2   second document and
2   it likes playing
2   tennis
3   This is the
3   third document and
3   it looks very
3   good today

Keep in mind that my dataframe may also have other columns except for these two which should be simply copied at the new dataframe in the same way as id above.

What is the most efficient way to do this?

I reckon that the solution to my question is quite close to the solution given here: Tokenise text and create more rows for each row in dataframe.

This may help too: Python: Split String every n word in smaller Strings.

Upvotes: 1

Views: 299

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150765

A self contained solution, maybe a little slower:

# Split every n words
n = 3

# incase id is not index yet
df.set_index('id', inplace=True)

new_df = df.text.str.split(' ', expand=True).stack().reset_index()

new_df = (new_df.groupby(['id', new_df.level_1//n])[0]
                .apply(lambda x: ' '.join(x))
                .reset_index(level=1, drop=True)
         )

new_df is a series:

id
1               I am the
1     first document and
1              I am very
1                 happy.
2            Here is the
2    second document and
2       it likes playing
2                tennis.
3            This is the
3     third document and
3          it looks very
3            good today.
Name: 0, dtype: object

Upvotes: 1

anky
anky

Reputation: 75080

You can use something like:

def divide_chunks(l, n): 
    # looping till length l 
    for i in range(0, len(l), n):  
        yield l[i:i + n] 

Then using unnesting:

df['text_new']=df.text.apply(lambda x: list(divide_chunks(x.split(),3)))
df_new=unnesting(df,['text_new']).drop('text',1)
df_new.text_new=df_new.text_new.apply(' '.join)
print(df_new)

              text_new  id
0             I am the   1
0   first document and   1
0            I am very   1
0               happy.   1
1          Here is the   2
1  second document and   2
1     it likes playing   2
1              tennis.   2
2          This is the   3
2   third document and   3
2        it looks very   3
2          good today.   3

EDIT:

m=(pd.DataFrame(df.text.apply(lambda x: list(divide_chunks(x.split(),3))).values.tolist())
.unstack().sort_index(level=1).apply(' '.join).reset_index(level=1))
m.columns=df.columns
print(m)

   id                 text
0   0             I am the
1   0   first document and
2   0            I am very
3   0               happy.
0   1          Here is the
1   1  second document and
2   1     it likes playing
3   1              tennis.
0   2          This is the
1   2   third document and
2   2        it looks very
3   2          good today.

Upvotes: 2

Related Questions