Reputation: 297
I want to extract only the full words of a string.
I have this df:
Students Age
0 Boston Terry Emma 23
1 Tommy Julien Cambridge 20
2 London 21
3 New York Liu 30
4 Anna-Madrid+ Pauline 26
5 Mozart Cambridge 27
6 Gigi Tokyo Lily 18
7 Paris Diane Marie Dive 22
And I want to extract the FULL words from the string, NOT parts of it (ex: I want Liu if Liu is written in names, not iu if just iu if written, because Liu is not iu.)
cities = ['Boston', 'Cambridge', 'Bruxelles', 'New York', 'London', 'Amsterdam', 'Madrid', 'Tokyo', 'Paris']
liked_names = ['Emma', 'Pauline', 'Tommy Julien', 'iu']
Desired df:
Students Age Cities Liked Names
0 Boston Terry Emma 23 Boston Emma
1 Tommy Julien Cambridge 20 Cambridge Tommy Julien
2 London 21 London NaN
3 New York Liu 30 New York NaN
4 Anna-Madrid+ Pauline 26 Madrid Pauline
5 Mozart Cambridge 27 Cambridge NaN
6 Gigi Tokyo Lily 18 Tokyo NaN
7 Paris Diane Marie Dive 22 Paris NaN
I tried this code:
pat = f'({"|".join(cities)})'
df['Cities'] = df['Students'].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df['Liked Names'] = df['Students'].str.extract(pat, expand=False)
My code for cities works, I just need to repair the issue for the 'Liked Names'.
How to make this work? Thanks a lot!!!
Upvotes: 0
Views: 101
Reputation: 2300
I think what you are looking for are word boundaries. In a regular expression they can be expressed with a \b
. An ugly (albeit working) solution is to modify the liked_names
list to include word boundaries and then run the code:
l = [
["Boston Terry Emma", 23],
["Tommy Julien Cambridge", 20],
["London", 21],
["New York Liu", 30],
["Anna-Madrid+ Pauline", 26],
["Mozart Cambridge", 27],
["Gigi Tokyo Lily", 18],
["Paris Diane Marie Dive", 22],
]
cities = [
"Boston",
"Cambridge",
"Bruxelles",
"New York",
"London",
"Amsterdam",
"Madrid",
"Tokyo",
"Paris",
]
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
# here we modify the liked_names to include word boundaries.
liked_names = [r"\b" + n + r"\b" for n in liked_names]
df = pd.DataFrame(l, columns=["Students", "Age"])
pat = f'({"|".join(cities)})'
df["Cities"] = df["Students"].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df["Liked Names"] = df["Students"].str.extract(pat, expand=False)
print(df)
A nicer solution would be to include the word boundaries in the creation of the regular expression.
I first tried using \s
, i.e. whitespace, but that did not work at the end of the list, so \b
was the solution. You can check https://regular-expressions.mobi/wordboundaries.html?wlr=1 for some details.
Upvotes: 1
Reputation: 30070
You can do an additional check to see if matched name is in Students
column.
import numpy as np
def check(row):
if row['Liked Names'] == row['Liked Names']:
# If `Liked Names` is not nan
# Get all possible names
patterns = row['Students'].split(' ')
# If matched `Liked Names` in `Students`
isAllMatched = all([name in patterns for name in row['Liked Names'].split(' ')])
if not isAllMatched:
return np.nan
else:
return row['Liked Names']
else:
# If `Liked Names` is nan, still return nan
return np.nan
df['Liked Names'] = df.apply(check, axis=1)
# print(df)
Students Age Cities Liked Names
0 Boston Terry Emma 23 Boston Emma
1 Tommy Julien Cambridge 20 Cambridge Tommy Julien
2 London 21 London NaN
3 New York Liu 30 New York NaN
4 Anna-Madrid+ Pauline 26 Madrid Pauline
5 Mozart Cambridge 27 Cambridge NaN
6 Gigi Tokyo Lily 18 Tokyo NaN
7 Paris Diane Marie Dive 22 Paris NaN
Upvotes: 0
Reputation: 195623
You can try this regex:
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
pat = (
"(" + "|".join(r"[a-zA-Z]*{}[a-zA-Z]*".format(n) for n in liked_names) + ")"
)
df["Liked Names"] = df["Students"].str.extract(pat)
print(df)
Prints:
Students Age Liked Names
0 Boston Terry Emma 23 Emma
1 Tommy Julien Cambridge 20 Tommy Julien
2 London 21 NaN
3 New York Liu 30 Liu
4 Anna-Madrid+ Pauline 26 Pauline
5 Mozart Cambridge 27 NaN
6 Gigi Tokyo Lily 18 NaN
7 Paris Diane Marie Dive 22 NaN
Upvotes: 0