Reputation: 37
I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code that is supposed to split the string after every three words:
import pandas as pd
import numpy as np
df1 = {
'State':['Arizona AZ asdf hello abc','Georgia GG asdfg hello def','Newyork NY asdfg hello ghi','Indiana IN asdfg hello jkl','Florida FL ASDFG hello mno']}
df1 = pd.DataFrame(df1,columns=['State'])
df1
def splitTextToTriplet(df):
text = df['State'].str.split()
n = 3
grouped_words = [' '.join(str(text[i:i+n]) for i in range(0,len(text),n))]
return grouped_words
splitTextToTriplet(df1)
Currently the output is as such:
['0 [Arizona, AZ, asdf, hello, abc]\n1 [Georgia, GG, asdfg, hello, def]\nName: State, dtype: object 2 [Newyork, NY, asdfg, hello, ghi]\n3 [Indiana, IN, asdfg, hello, jkl]\nName: State, dtype: object 4 [Florida, FL, ASDFG, hello, mno]\nName: State, dtype: object']
But I am actually expecting this output in 5 rows, one column on dataframe:
['Arizona AZ asdf', 'hello abc']
['Georgia GG asdfg', 'hello def']
['Newyork NY asdfg', 'hello ghi']
['Indiana IN asdfg', 'hello jkl']
['Florida FL ASDFG', 'hello mno']
how can I change the regex so it produces the expected output?
Upvotes: 0
Views: 314
Reputation: 22857
You can do:
def splitTextToTriplet(row):
text = row['State'].split()
n = 3
grouped_words = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]
return grouped_words
df1.apply(lambda row: splitTextToTriplet(row), axis=1)
which gives as output the following Dataframe:
0 | |
---|---|
0 | ['Arizona AZ asdf', 'hello abc'] |
1 | ['Georgia GG asdfg', 'hello def'] |
2 | ['Newyork NY asdfg', 'hello ghi'] |
3 | ['Indiana IN asdfg', 'hello jkl'] |
4 | ['Florida FL ASDFG', 'hello mno'] |
Upvotes: 1
Reputation: 260975
For efficiency, you can use a regex and str.extractall
+ groupby
/agg
:
(df1['State']
.str.extractall(r'((?:\w+\b\s*){1,3})')[0]
.groupby(level=0).agg(list)
)
output:
0 [Arizona AZ asdf , hello abc]
1 [Georgia GG asdfg , hello def]
2 [Newyork NY asdfg , hello ghi]
3 [Indiana IN asdfg , hello jkl]
4 [Florida FL ASDFG , hello mno]
regex:
( # start capturing
(?:\w+\b\s*) # words
{1,3} # the maximum, up to three
) # end capturing
Upvotes: 1