Reputation: 671
I have a column A which is the subset of another column B in my dataframe. I want to split the data in the parent column B with A so I get the remaining of B in a new column C. I was trying the below piece using lambda. But the inner column is A which I mention as Series can't be converted into str.
df['C'] = df['B'].apply(lambda x: x.split(df['A'],1)[1])
Example DF:
B A
Ford F-Series pickup Ford
Toyota Camry Toyoto
Ford Taurus/Taurus X Ford
Needed Output Data column:
C
F-Series pickup
Camry
Taurus/Taurus X
Here as the df['A'] is a series, we need it as a data from each row to do the same. Is there any possibility to pass the column A data as string. Any help would be appreciated. Thanks.
Upvotes: 3
Views: 3692
Reputation: 3848
A more efficient way of doing this is by creating a list that loops through the .values
of the columns you're seeking to perform string operations on.
In your example it would look like this:
df['C'] = [b.split(a)[0] for (b, a) in df[['B', 'A']].values]
To show that this is more efficient I ran the following test using one of my own datasets from google analytics:
Vector Solution
CPU times: user 30.8 ms, sys: 0 ns, total: 30.8 ms
Wall time: 32.7 ms
appended_ga4_data['top_level'] = [y.split(x)[0] for (y, x) in appended_ga4_data[['Page path and screen class', 'subject']].values]
If we wanted to use the vector solution to get the second (or nth) element in the split, then we would need to include a conditional check that the list is long enough to get the index (assuming the split doesn't always apply):
Vector Solution with Conditions
CPU times: user 53.9 ms, sys: 635 µs, total: 54.5 ms
Wall time: 66.4 ms
appended_ga4_data['top_level'] = [y.split(x)[1] if len(y.split(x))>1 else y.split(x)[0] for (y, x) in appended_ga4_data[['Page path and screen class', 'subject']].values]
Apply Solution
CPU times: user 245 ms, sys: 6.59 ms, total: 251 ms
Wall time: 263 ms
appended_ga4_data['top_level'] = appended_ga4_data.apply(lambda x: x['Page path and screen class'].split(x['subject'])[0], axis=1)
Upvotes: 0
Reputation: 4855
I'm assuming that you aren't just trying to remove the model name, but also remove everything before the name.
The code below accomplishes what you want:
import pandas as pd
df = pd.DataFrame( {'A':['Ford', 'Toyota', 'Ford'],
'B':['Ford F-Series pickup', 'Toyota Camry', 'Ford Taurus/Taurus X']} )
df['C'] = df.apply(lambda x: x['B'].split(x['A'], 1), axis=1)['B']
print(df)
Produces:
A B C
0 Ford Ford F-Series pickup F-Series pickup
1 Toyota Toyota Camry Camry
2 Ford Ford Taurus/Taurus X Taurus/Taurus X
Upvotes: 3
Reputation: 1576
You can use the .replace
method. Since you will create a leading whitespace after replacing the brand name with nothing, the .strip
method will remove those.
import pandas as pd
df = pd.DataFrame({'A': ["Ford", "Audi"], 'B': ["Ford Pickup", "Audi R8"]})
df['C'] = df.apply(lambda x: x['B'].replace(x['A'], '').strip(), axis=1)
Upvotes: 2