GLP
GLP

Reputation: 3675

How to return multiple columns using apply in Pandas dataframe

I am trying to apply a function to a column of a Pandas dataframe, the function returns a list of tuples. This is my function:

def myfunc(text):
  values=[]
  sections=api_call(text)
  for (part1, part2, part3) in sections:
    value=(part1, part2, part3) 
    values.append(value)
  return values

For example,

sections=myfunc("History: Had a fever\n Allergies: No")
print(sections)

output:

[('past_medical_history', 'History:', 'History: Had a fever\n '), ('allergies', 'Allergies:', 'Allergies: No')]

For each tuple, I would like to create a new column. For example:

the original dataframe looks like this:

id text
0  History: Had a fever\n Allergies: No
1  text2

and after applying the function, I want the dataframe to look like this (where xxx is various text content):

id text            part1        part2        part3
0  History: Had... past_...     History:     History: ...
0  Allergies: No   allergies    Allergies:   Allergies: No
1  text2           xxx          xxx          xxx
1  text2           xxx          xxx          xxx
1  text2           xxx          xxx          xxx
...

I could loop through the dataframe and generate a new dataframe but it would be really slow. I tried following code but received a ValueError. Any suggestions?

df.apply(lambda x: pd.Series(myfunc(x['col']), index=['part1', 'part2', 'part3']), axis=1)

I did a little bit more research, so my question actually boils down to how to unnest a column with a list of tuples. I found the answer from this link Split a list of tuples in a column of dataframe to columns of a dataframe helps. And here is what I did

# step1: sectionizing
df["sections"] =df["text"].apply(myfunc)

# step2: unnest the sections 
part1s = []
part2s = []
part3s = []
ids = []

def create_lists(row):
    tuples = row['sections']
    id = row['id']
    for t in tuples:
        part1s.append(t[0])
        part2s.append(t[1])
        part3s.append(t[2])
        ids.append(id)

df.apply(create_lists, axis=1)

new_df = pd.DataFrame({"part1" :part1s, "part2": part2s, "part3": part3s, 
                       "id": ids})[["part1", "part2", 'part3', "id"]]

But the performance is not so good. I wonder if there is better way.

Upvotes: 4

Views: 5165

Answers (2)

Joe Ferndz
Joe Ferndz

Reputation: 8508

Converting the tuple to new columns:

To convert the tuple column value to new columns, you can do the following:

df[['part1', 'part2', 'part3']] = pd.DataFrame(df['text'].tolist())
print (df)

The output of this will be:

                                                text                 part1  \
0  (past_medical_history, History:, History: Had ...  past_medical_history   
1             (allergies, Allergies:, Allergies: No)             allergies   

        part2                    part3  
0    History:  History: Had a fever\n   
1  Allergies:            Allergies: No  

If the tuples in df['text'] is varying (not constant 3 items), then you can concat as follows:

df = pd.concat([df[['text']],pd.DataFrame(df['text'].tolist()).add_prefix('part')],axis=1)

This will give you the same result as earlier. Column names will differ slightly.

Converting comma separated values in a column to separate columns

You don't need to have a function to do this. You already have a pd.Series. All you have to do is split and expand.

df[['part1', 'part2', 'part3']] = df['names'].str.split(',',expand=True)

Output of this will be:

     names part1 part2 part3
0    a,b,c     a     b     c
1    e,f,g     e     f     g
2    x,y,z     x     y     z

In case you have odd number of values in the names column and you want to split them into 3 parts, you can do it as follows:

within the split, you can specify how many columns you want to split them into. value of n sets the split to n parts (starting with 0. If you need 3 columns, n=2)

import pandas as pd
data = { 'names' : ['a,b,c','d,e,f','p,q,r,s','x,y,z']}
df = pd.DataFrame(data)
df = pd.concat([df[['names']],df['names'].str.split(',',n=2,expand=True).add_prefix('part')],axis=1)
print (df)

The output will be:

     names part0 part1 part2
0    a,b,c     a     b     c
1    d,e,f     d     e     f
2  p,q,r,s     p     q   r,s
3    x,y,z     x     y     z

Or you can also do it as follows:

df[['part1', 'part2', 'part3']] = df['names'].str.split(',',n=2,expand=True)

This will give you the same result as follows:

     names part1 part2 part3
0    a,b,c     a     b     c
1    d,e,f     d     e     f
2  p,q,r,s     p     q   r,s
3    x,y,z     x     y     z

And in case you want to get all the values split into each column, then you can do this:

df = pd.concat([df[['names']],df['names'].str.split(',',expand=True).add_prefix('part').fillna('')],axis=1)

The output of this will be:

     names part0 part1 part2 part3
0    a,b,c     a     b     c      
1    d,e,f     d     e     f      
2  p,q,r,s     p     q     r     s
3    x,y,z     x     y     z      

You can decide to do np.nan instead if you want to store NaN values.

In case you have multiple delimiters to consider and split the column, then use this.

import pandas as pd
data = { 'names' : ['a,b,c','d,e,f','p;q,r,s','x,y\nz,w']}
df = pd.DataFrame(data)
df = pd.concat([df[['names']],df['names'].str.split(',|\n|;',expand=True).add_prefix('part').fillna('')],axis=1)
print (df)

The output will be as follows:

      names part0 part1 part2 part3
0     a,b,c     a     b     c      
1     d,e,f     d     e     f      
2   p;q,r,s     p     q     r     s
3  x,y\nz,w     x     y     z     w

Upvotes: 1

Paul Brennan
Paul Brennan

Reputation: 2696

The idea here is to set up some data and a function that can be operated on this data to generate three items that we can return. Choosing split and comma-separated values seems to be quick and mirror the function you are after.

import pandas as pd
data = { 'names' : ['x,a,c','y,er,rt','z,1,ere']}
df = pd.DataFrame(data)

gives

     names
0    x,a,c
1  y,er,rt
2  z,1,ere

now

def myfunc(text):
  sections=text.split(',')
  return sections

df[['part1', 'part2', 'part3']] = df['names'].apply(myfunc)

will give

    names   part1   part2   part3
0   x,a,c   x       y       z
1   y,er,rt a       er      1
2   z,1,ere c       rt      ere

Which is probably not what you want, however

df['part1'] ,df['part2'], df['part3'] = zip(*df['names'].apply(myfunc))

gives

     names     part1 part2 part3
0    x,a,c     x     a     c
1  y,er,rt     y     er    rt
2  z,1,ere     z     1     ere

which is probably what you want.

Upvotes: 1

Related Questions