Reputation: 10011
I want to replace the position words from strings
column: if they are either present sole or in multiple but join with ,
and space
.
id strings
0 1 south
1 2 north
2 3 east
3 4 west
4 5 west, east, south
5 6 west, west
6 7 north, north
7 8 north, south
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
My expected result will like this. Please note if they are components of phrase or words, then I don't need to replace them.
Is it possible to do that? Thank you.
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
The following code works, but I just wonder if there are some more concise methods?
df['strings'].astype(str).replace('south', np.nan).replace('north', np.nan)\
.replace('west', np.nan).replace('east', np.nan).replace('west, east', np.nan)\
.replace('west, west', np.nan).replace('north, north', np.nan).replace('west, east', np.nan)\
.replace('north, south', np.nan)
Upvotes: 1
Views: 674
Reputation: 82765
Using Regex.
Ex:
df = pd.DataFrame({'strings': ['south', 'north', 'east', 'west', 'west, east, south', 'west, west', 'north, north', 'north, south', 'West Corporation global office', 'West-Riding', 'University of West Florida', 'Southwest']})
df['R'] = df['strings'].replace(r"\b(south|north|east|west)\b,?", np.NAN, regex=True)
print(df)
Output:
strings R
0 south NaN
1 north NaN
2 east NaN
3 west NaN
4 west, east, south NaN
5 west, west NaN
6 north, north NaN
7 north, south NaN
8 West Corporation global office West Corporation global office
9 West-Riding West-Riding
10 University of West Florida University of West Florida
11 Southwest Southwest
Upvotes: 1
Reputation: 862611
First use Series.str.split
, forward filling for replace missing values, test if all matched values by DataFrame.isin
and DataFrame.all
for mask and last set missing values by Series.mask
:
L = ['south','north','east','west']
m = df['strings'].str.split(', ', expand=True).ffill(axis=1).isin(L).all(axis=1)
df['strings'] = df['strings'].mask(m)
print (df)
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
Another idea with set
s, isdisjoint
and Series.where
:
m = [set(x.split(', ')).isdisjoint(L) for x in df['strings']]
df['strings'] = df['strings'].where(m)
print (df)
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
Upvotes: 2