Reputation: 3675
I am trying to apply a function to a column of a Pandas dataframe, the function returns a list of tuples. This is my function:
def myfunc(text):
values=[]
sections=api_call(text)
for (part1, part2, part3) in sections:
value=(part1, part2, part3)
values.append(value)
return values
For example,
sections=myfunc("History: Had a fever\n Allergies: No")
print(sections)
output:
[('past_medical_history', 'History:', 'History: Had a fever\n '), ('allergies', 'Allergies:', 'Allergies: No')]
For each tuple, I would like to create a new column. For example:
the original dataframe looks like this:
id text
0 History: Had a fever\n Allergies: No
1 text2
and after applying the function, I want the dataframe to look like this (where xxx is various text content):
id text part1 part2 part3
0 History: Had... past_... History: History: ...
0 Allergies: No allergies Allergies: Allergies: No
1 text2 xxx xxx xxx
1 text2 xxx xxx xxx
1 text2 xxx xxx xxx
...
I could loop through the dataframe and generate a new dataframe but it would be really slow. I tried following code but received a ValueError. Any suggestions?
df.apply(lambda x: pd.Series(myfunc(x['col']), index=['part1', 'part2', 'part3']), axis=1)
I did a little bit more research, so my question actually boils down to how to unnest a column with a list of tuples. I found the answer from this link Split a list of tuples in a column of dataframe to columns of a dataframe helps. And here is what I did
# step1: sectionizing
df["sections"] =df["text"].apply(myfunc)
# step2: unnest the sections
part1s = []
part2s = []
part3s = []
ids = []
def create_lists(row):
tuples = row['sections']
id = row['id']
for t in tuples:
part1s.append(t[0])
part2s.append(t[1])
part3s.append(t[2])
ids.append(id)
df.apply(create_lists, axis=1)
new_df = pd.DataFrame({"part1" :part1s, "part2": part2s, "part3": part3s,
"id": ids})[["part1", "part2", 'part3', "id"]]
But the performance is not so good. I wonder if there is better way.
Upvotes: 4
Views: 5165
Reputation: 8508
To convert the tuple column value to new columns, you can do the following:
df[['part1', 'part2', 'part3']] = pd.DataFrame(df['text'].tolist())
print (df)
The output of this will be:
text part1 \
0 (past_medical_history, History:, History: Had ... past_medical_history
1 (allergies, Allergies:, Allergies: No) allergies
part2 part3
0 History: History: Had a fever\n
1 Allergies: Allergies: No
If the tuples in df['text']
is varying (not constant 3 items), then you can concat as follows:
df = pd.concat([df[['text']],pd.DataFrame(df['text'].tolist()).add_prefix('part')],axis=1)
This will give you the same result as earlier. Column names will differ slightly.
You don't need to have a function to do this. You already have a pd.Series. All you have to do is split and expand.
df[['part1', 'part2', 'part3']] = df['names'].str.split(',',expand=True)
Output of this will be:
names part1 part2 part3
0 a,b,c a b c
1 e,f,g e f g
2 x,y,z x y z
In case you have odd number of values in the names
column and you want to split them into 3 parts, you can do it as follows:
within the split, you can specify how many columns you want to split them into. value of n sets the split to n parts (starting with 0. If you need 3 columns, n=2)
import pandas as pd
data = { 'names' : ['a,b,c','d,e,f','p,q,r,s','x,y,z']}
df = pd.DataFrame(data)
df = pd.concat([df[['names']],df['names'].str.split(',',n=2,expand=True).add_prefix('part')],axis=1)
print (df)
The output will be:
names part0 part1 part2
0 a,b,c a b c
1 d,e,f d e f
2 p,q,r,s p q r,s
3 x,y,z x y z
Or you can also do it as follows:
df[['part1', 'part2', 'part3']] = df['names'].str.split(',',n=2,expand=True)
This will give you the same result as follows:
names part1 part2 part3
0 a,b,c a b c
1 d,e,f d e f
2 p,q,r,s p q r,s
3 x,y,z x y z
And in case you want to get all the values split into each column, then you can do this:
df = pd.concat([df[['names']],df['names'].str.split(',',expand=True).add_prefix('part').fillna('')],axis=1)
The output of this will be:
names part0 part1 part2 part3
0 a,b,c a b c
1 d,e,f d e f
2 p,q,r,s p q r s
3 x,y,z x y z
You can decide to do np.nan
instead if you want to store NaN
values.
In case you have multiple delimiters to consider and split the column, then use this.
import pandas as pd
data = { 'names' : ['a,b,c','d,e,f','p;q,r,s','x,y\nz,w']}
df = pd.DataFrame(data)
df = pd.concat([df[['names']],df['names'].str.split(',|\n|;',expand=True).add_prefix('part').fillna('')],axis=1)
print (df)
The output will be as follows:
names part0 part1 part2 part3
0 a,b,c a b c
1 d,e,f d e f
2 p;q,r,s p q r s
3 x,y\nz,w x y z w
Upvotes: 1
Reputation: 2696
The idea here is to set up some data and a function that can be operated on this data to generate three items that we can return. Choosing split and comma-separated values seems to be quick and mirror the function you are after.
import pandas as pd
data = { 'names' : ['x,a,c','y,er,rt','z,1,ere']}
df = pd.DataFrame(data)
gives
names
0 x,a,c
1 y,er,rt
2 z,1,ere
now
def myfunc(text):
sections=text.split(',')
return sections
df[['part1', 'part2', 'part3']] = df['names'].apply(myfunc)
will give
names part1 part2 part3
0 x,a,c x y z
1 y,er,rt a er 1
2 z,1,ere c rt ere
Which is probably not what you want, however
df['part1'] ,df['part2'], df['part3'] = zip(*df['names'].apply(myfunc))
gives
names part1 part2 part3
0 x,a,c x a c
1 y,er,rt y er rt
2 z,1,ere z 1 ere
which is probably what you want.
Upvotes: 1