Niken Amelia
Niken Amelia

Reputation: 37

Python : Split string every three words in dataframe

I've been searching around for a while now, but I can't seem to find the answer to this small problem.

I have this code that is supposed to split the string after every three words:

import pandas as pd
import numpy as np

df1 = {
    'State':['Arizona AZ asdf hello abc','Georgia GG asdfg hello def','Newyork NY asdfg hello ghi','Indiana IN asdfg hello jkl','Florida FL ASDFG hello mno']}

df1 = pd.DataFrame(df1,columns=['State'])
df1

def splitTextToTriplet(df):
    text = df['State'].str.split()
    n = 3
    grouped_words = [' '.join(str(text[i:i+n]) for i in range(0,len(text),n))]
    return grouped_words

splitTextToTriplet(df1)

Currently the output is as such:

['0     [Arizona, AZ, asdf, hello, abc]\n1    [Georgia, GG, asdfg, hello, def]\nName: State, dtype: object 2    [Newyork, NY, asdfg, hello, ghi]\n3    [Indiana, IN, asdfg, hello, jkl]\nName: State, dtype: object 4    [Florida, FL, ASDFG, hello, mno]\nName: State, dtype: object']

But I am actually expecting this output in 5 rows, one column on dataframe:

['Arizona AZ asdf', 'hello abc']
['Georgia GG asdfg', 'hello def']
['Newyork NY asdfg', 'hello ghi']
['Indiana IN asdfg', 'hello jkl']
['Florida FL ASDFG', 'hello mno']

how can I change the regex so it produces the expected output?

Upvotes: 0

Views: 314

Answers (2)

BioGeek
BioGeek

Reputation: 22857

You can do:

def splitTextToTriplet(row):
    text = row['State'].split()
    n = 3
    grouped_words = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]
    return grouped_words

df1.apply(lambda row: splitTextToTriplet(row), axis=1)

which gives as output the following Dataframe:

0
0 ['Arizona AZ asdf', 'hello abc']
1 ['Georgia GG asdfg', 'hello def']
2 ['Newyork NY asdfg', 'hello ghi']
3 ['Indiana IN asdfg', 'hello jkl']
4 ['Florida FL ASDFG', 'hello mno']

Upvotes: 1

mozway
mozway

Reputation: 260975

For efficiency, you can use a regex and str.extractall + groupby/agg:

(df1['State']
 .str.extractall(r'((?:\w+\b\s*){1,3})')[0]
 .groupby(level=0).agg(list)
)

output:

0     [Arizona AZ asdf , hello abc]
1    [Georgia GG asdfg , hello def]
2    [Newyork NY asdfg , hello ghi]
3    [Indiana IN asdfg , hello jkl]
4    [Florida FL ASDFG , hello mno]

regex:

(             # start capturing
(?:\w+\b\s*)  # words
{1,3}         # the maximum, up to three
)             # end capturing

Upvotes: 1

Related Questions