Naveen Balasubramanian
Naveen Balasubramanian

Reputation: 671

Split a column data using another column data in Pandas Dataframe

I have a column A which is the subset of another column B in my dataframe. I want to split the data in the parent column B with A so I get the remaining of B in a new column C. I was trying the below piece using lambda. But the inner column is A which I mention as Series can't be converted into str.

df['C'] = df['B'].apply(lambda x: x.split(df['A'],1)[1])

Example DF:

B                       A
Ford F-Series pickup    Ford
Toyota Camry            Toyoto
Ford Taurus/Taurus X    Ford

Needed Output Data column:

C
F-Series pickup
Camry
Taurus/Taurus X

Here as the df['A'] is a series, we need it as a data from each row to do the same. Is there any possibility to pass the column A data as string. Any help would be appreciated. Thanks.

Upvotes: 3

Views: 3692

Answers (3)

seeiespi
seeiespi

Reputation: 3848

A more efficient way of doing this is by creating a list that loops through the .values of the columns you're seeking to perform string operations on.

In your example it would look like this:

df['C'] = [b.split(a)[0] for (b, a) in df[['B', 'A']].values]

To show that this is more efficient I ran the following test using one of my own datasets from google analytics:

Vector Solution
CPU times: user 30.8 ms, sys: 0 ns, total: 30.8 ms Wall time: 32.7 ms
appended_ga4_data['top_level'] = [y.split(x)[0] for (y, x) in appended_ga4_data[['Page path and screen class', 'subject']].values]

If we wanted to use the vector solution to get the second (or nth) element in the split, then we would need to include a conditional check that the list is long enough to get the index (assuming the split doesn't always apply):

Vector Solution with Conditions
CPU times: user 53.9 ms, sys: 635 µs, total: 54.5 ms Wall time: 66.4 ms
appended_ga4_data['top_level'] = [y.split(x)[1] if len(y.split(x))>1 else y.split(x)[0] for (y, x) in appended_ga4_data[['Page path and screen class', 'subject']].values]

Apply Solution
CPU times: user 245 ms, sys: 6.59 ms, total: 251 ms Wall time: 263 ms
appended_ga4_data['top_level'] = appended_ga4_data.apply(lambda x: x['Page path and screen class'].split(x['subject'])[0], axis=1)

Upvotes: 0

Craig
Craig

Reputation: 4855

I'm assuming that you aren't just trying to remove the model name, but also remove everything before the name.

The code below accomplishes what you want:

import pandas as pd

df = pd.DataFrame( {'A':['Ford', 'Toyota', 'Ford'], 
                    'B':['Ford F-Series pickup', 'Toyota Camry', 'Ford Taurus/Taurus X']} )

df['C'] = df.apply(lambda x: x['B'].split(x['A'], 1), axis=1)['B']
print(df)

Produces:

        A                     B                 C
0    Ford  Ford F-Series pickup   F-Series pickup
1  Toyota          Toyota Camry             Camry
2    Ford  Ford Taurus/Taurus X   Taurus/Taurus X

Upvotes: 3

Tristan
Tristan

Reputation: 1576

You can use the .replace method. Since you will create a leading whitespace after replacing the brand name with nothing, the .strip method will remove those.

import pandas as pd
df = pd.DataFrame({'A': ["Ford", "Audi"], 'B': ["Ford Pickup", "Audi R8"]})

df['C'] = df.apply(lambda x: x['B'].replace(x['A'], '').strip(), axis=1)

Upvotes: 2

Related Questions