Reputation: 25
Audience Ad
Audience1 Audience4.Ad1.image
Audience2 Audience1.Ad4.image
Audience3 Audience7.Ad1.image
Audience4 Audience2.Ad3.image
Audience5 Audience9.Ad1.image
Audience6 Audience4.Ad2.image
Audience7 Audience5.Ad1.image
Audience8 Audience7.Ad3.image
Audience9 Audience8.Ad1.image
Audience10 Audience9.Ad1.image
This is some sample data. What I am trying to do is to look in the Ad column and if it contains anything from the Audience column replace it with nothing. The most difficult part for me here is that on the left might say Audience1 and on the right Audience2 so that they are not the same. If they were I know how to implement this but unfortunately, they are not!
So the expected result will look like this
Audience Ad
Audience1 Ad1.image
Audience2 Ad4.image
Audience3 Ad1.image
Audience4 Ad3.image
Audience5 Ad1.image
Audience6 Ad2.image
Audience7 Ad1.image
Audience8 Ad3.image
Audience9 Ad1.image
Audience10 Ad1.image
The way I thought of doing this is by going through the Audience column with a for loop and then if I find that any of the elements of the Audience column is contained in the Ad column delete it.
This is how I tried to solve it but I am stuck on what to put in the return statement (if the rest of the logic of course is correct):
def replace(text):
for i in df['Audience']:
if i in text:
return ???
df['Ad'] = df['Ad'].apply(replace)
Any help would be appreciated!
Upvotes: 2
Views: 124
Reputation: 62403
Audience
to a set
to make sure there are no repeated values.str.split
the Ad
columnRemove terms from Ad
that are in the aud
list, with a list comprehension, and then str.join
the terms.
[y for y in x if y not in aud]
is a list comprehension
.split
. This iterates through each value and checks if it's in the aud
list. It it is, then it's not included in the new list.'.'.join()
creates a string from the elements of the listGiven a sample dataset of 10e6 rows (df = pd.concat([pd.DataFrame(data)]*1000000)
):
Wall time: 16.9 s
Wall time: 27.7 s
Wall time: 15.7 s
df[Audience]
, because the words are being joined into a string.import pandas as pd
# data and dataframe
data = {'Audience': ['Audience1', 'Audience2', 'Audience3', 'Audience4', 'Audience5', 'Audience6', 'Audience7', 'Audience8', 'Audience9', 'Audience10'],
'Ad': ['Audience4.Ad1.image', 'Audience1.Ad4.image', 'Audience7.Ad1.image', 'Audience2.Ad3.image', 'Audience9.Ad1.image', 'Audience4.Ad2.image', 'Audience5.Ad1.image', 'Audience7.Ad3.image', 'Audience8.Ad1.image', 'Audience9.Ad1.image']}
df = pd.DataFrame(data)
# create list of unique words from Audience
aud = set(df.Audience.str.lower())
# remove Audience words from Ad column
df.Ad = df.Ad.str.split('.').apply(lambda x: '.'.join([y for y in x if y.lower() not in aud]))
| | Audience | Ad |
|---:|:-----------|:----------|
| 0 | Audience1 | Ad1.image |
| 1 | Audience2 | Ad4.image |
| 2 | Audience3 | Ad1.image |
| 3 | Audience4 | Ad3.image |
| 4 | Audience5 | Ad1.image |
| 5 | Audience6 | Ad2.image |
| 6 | Audience7 | Ad1.image |
| 7 | Audience8 | Ad3.image |
| 8 | Audience9 | Ad1.image |
| 9 | Audience10 | Ad1.image |
data
data = {'Audience': ['Football.And.Basketball.Interests', 'Baseball.Interests', 'Cricket.Interests', 'Website.Visitors'],
'Ad': ['Baseball.Interests.Ad1.image', 'Football.And.Basketball.Interests.Ad4.image', 'Cricket.Interests.Ad1.image', 'Website.Visitors.Ad3.image']}
df = pd.DataFrame(data)
Audience Ad
Football.And.Basketball.Interests Baseball.Interests.Ad1.image
Baseball.Interests Football.And.Basketball.Interests.Ad4.image
Cricket.Interests Cricket.Interests.Ad1.image
Website.Visitors Website.Visitors.Ad3.image
# if Audience contains multiple values
aud = set(df.Audience.str.split('.').explode().str.lower())
# remove Audience words from Ad column
df.Ad = df.Ad.str.split('.').apply(lambda x: '.'.join([y for y in x if y.lower() not in aud]))
Audience Ad
Football.And.Basketball.Interests Ad1.image
Baseball.Interests Ad4.image
Cricket.Interests Ad1.image
Website.Visitors Ad3.image
Upvotes: 2
Reputation: 20669
You can use pd.Series.str.replace
with pd.Series.contains
mask = df['Ad'].str.contains('\.|'.join(set(df['Audience'])))
df.loc[mask,'Ad'] = df.loc[mask,'Ad'].str.replace(r'(Audience\d+.)','')
df
Audience Ad
0 Audience1 Ad1.image
1 Audience2 Ad4.image
2 Audience3 Ad1.image
3 Audience4 Ad3.image
4 Audience5 Ad1.image
5 Audience6 Ad2.image
6 Audience7 Ad1.image
7 Audience8 Ad3.image
8 Audience9 Ad1.image
9 Audience10 Ad1.image
Example with non-matches:
df
Audience Ad
0 Audience1 Audience4.Ad1.image
1 Audience2 Audience1.Ad4.image
2 Audience3 Audience7.Ad1.image
3 Audience4 Audience2.Ad3.image
4 Audience5 Audience9.Ad1.image
5 Audience6 Audience4.Ad2.image
6 Audience7 Audience5.Ad1.image
7 Audience8 Audience7.Ad3.image
8 Audience9 Audience8.Ad1.image
9 Audience10 Audience9.Ad1.image
10 Audience12 Audience11.Ad11.image
mask = df['Ad'].str.contains('\.|'.join(set(df['Audience'])))
df.loc[mask,'Ad'] = df.loc[mask,'Ad'].str.replace(r'(Audience\d+.)','')
df
Audience Ad
0 Audience1 Ad1.image
1 Audience2 Ad4.image
2 Audience3 Ad1.image
3 Audience4 Ad3.image
4 Audience5 Ad1.image
5 Audience6 Ad2.image
6 Audience7 Ad1.image
7 Audience8 Ad3.image
8 Audience9 Ad1.image
9 Audience10 Ad1.image
10 Audience12 Audience11.Ad11.image #---> Audience11 not deleted as 'Audience11' is not in `df['Audience']`
Upvotes: 2
Reputation: 71689
Use, Series.str
methods along with Series.isin
, Series.where
:
s = df['Ad'].str.split('.')
m = s.str[0].isin(df['Audience'])
df['Ad'] = s.where(~m, s.str[1:]).str.join('.')
# print(df)
Audience Ad
0 Audience1 Ad1.image
1 Audience2 Ad4.image
2 Audience3 Ad1.image
3 Audience4 Ad3.image
4 Audience5 Ad1.image
5 Audience6 Ad2.image
6 Audience7 Ad1.image
7 Audience8 Ad3.image
8 Audience9 Ad1.image
9 Audience10 Ad1.image
Upvotes: 1