gnotnek
gnotnek

Reputation: 319

apply custom function to an existing column to output multiple columns

Here is my starting df:

import numpy as np
import pandas as pd

df = pd.DataFrame(['alpha', 'beta'], columns = ['text'])
df
    text
0   alpha
1   beta

Here is the end result I want:

    text    first           second          third
0   alpha   alpha-first     alpha-second    alpha-third
1   beta    beta-first      beta-second     beta-third

I have written the custom function parse(), no issue there:

def parse(text):
    return [text + ' first', text + ' second', text + ' third']

Now I try to apply parse() to the initial df, which is where errors arise:

1) If I try the following:

df = df.reindex(columns = list(df.columns) + ['first', 'second', 'third']) # Create empty columns    
df[['first', 'second', 'third']] = df.text.apply(parse)

I get:

ValueError: Must have equal len keys and value when setting with an ndarray

2) Slightly different version:

df = df.reindex(columns = list(df.columns) + ['first', 'second', 'third']).astype(object) # Create empty columns of "object" type    
df[['first', 'second', 'third']] = df.text.apply(parse)

I get:

ValueError: shape mismatch: value array of shape (2,) could not be broadcast 
to indexing result of shape (3,2)

Where am I going wrong?

EDIT:

I should clarify that parse() itself is a much more complicated function in the real-world problem I'm trying to solve. (it takes a paragraph, finds 3 specific types of strings in it, and outputs those strings as a list of length 3). In my code above, I made up a somewhat random simple definition of parse() as a substitute to avoid getting bogged down in details unrelated to the two errors I'm getting.

Upvotes: 1

Views: 160

Answers (4)

thomas.mac
thomas.mac

Reputation: 1256

Check this:

lst = ['text','first','second','third']
df = pd.DataFrame([['alpha']*len(lst),['beta']*len(lst)],columns=lst)

final = df.apply(lambda x: x+'-'+x.name)
final.text = final.text.str.split('-')[0]

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

This can be done in a several ways:

Option 1:

def f(s):
    return pd.DataFrame(np.repeat(s, 3).values.reshape(len(s), -1),
                        columns=['first','second','third']) \
             .apply(lambda c: c+'-'+c.name)


In [183]: df[['first','second','third']] = f(df.text)

In [184]: df
Out[184]:
    text        first        second        third
0  alpha  alpha-first  alpha-second  alpha-third
1   beta   beta-first   beta-second   beta-third

Upvotes: 1

cmaher
cmaher

Reputation: 5215

Here's a one-liner with pd.DataFrame.assign:

df.assign(**{x: df['text']+'-'+x for x in ['first', 'second', 'third']})

#     text        first        second        third
# 0  alpha  alpha-first  alpha-second  alpha-third
# 1   beta   beta-first   beta-second   beta-third

Upvotes: 1

jpp
jpp

Reputation: 164623

No need for apply:

import pandas as pd

df = pd.DataFrame(['alpha', 'beta'], columns = ['text'])

for i in ['first', 'second', 'third']:
    df[i] = df.text + '-' + i

#     text       first       second       third
# 0  alpha  alpha-first  alpha-second  alpha-third
# 1   beta   beta-first   beta-second   beta-third

In general the hierarchy of "process type" to choose for your calculations should be:

  1. Vectorised calculations, such as above.
  2. pd.Series.apply
  3. pd.DataFrame.apply
  4. pd.DataFrame.iterrows

Upvotes: 2

Related Questions