Elisa L.
Elisa L.

Reputation: 297

How to only extract the full words of a string in Python?

I want to extract only the full words of a string.

I have this df:

                     Students  Age
0           Boston Terry Emma   23
1      Tommy Julien Cambridge   20
2                      London   21
3                New York Liu   30
4  Anna-Madrid+       Pauline   26
5         Mozart    Cambridge   27
6             Gigi Tokyo Lily   18
7      Paris Diane Marie Dive   22

And I want to extract the FULL words from the string, NOT parts of it (ex: I want Liu if Liu is written in names, not iu if just iu if written, because Liu is not iu.)

cities = ['Boston', 'Cambridge', 'Bruxelles', 'New York', 'London', 'Amsterdam', 'Madrid', 'Tokyo', 'Paris']
liked_names = ['Emma', 'Pauline', 'Tommy Julien', 'iu']

Desired df:

                     Students  Age     Cities   Liked Names
0           Boston Terry Emma   23     Boston          Emma
1      Tommy Julien Cambridge   20  Cambridge  Tommy Julien
2                      London   21     London           NaN
3                New York Liu   30   New York           NaN
4  Anna-Madrid+       Pauline   26     Madrid       Pauline
5         Mozart    Cambridge   27  Cambridge           NaN
6             Gigi Tokyo Lily   18      Tokyo           NaN
7      Paris Diane Marie Dive   22      Paris           NaN

I tried this code:

pat = f'({"|".join(cities)})'
df['Cities'] = df['Students'].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df['Liked Names'] = df['Students'].str.extract(pat, expand=False)

My code for cities works, I just need to repair the issue for the 'Liked Names'.

How to make this work? Thanks a lot!!!

Upvotes: 0

Views: 101

Answers (3)

divingTobi
divingTobi

Reputation: 2300

I think what you are looking for are word boundaries. In a regular expression they can be expressed with a \b. An ugly (albeit working) solution is to modify the liked_names list to include word boundaries and then run the code:

l = [
    ["Boston Terry Emma", 23],
    ["Tommy Julien Cambridge", 20],
    ["London", 21],
    ["New York Liu", 30],
    ["Anna-Madrid+       Pauline", 26],
    ["Mozart    Cambridge", 27],
    ["Gigi Tokyo Lily", 18],
    ["Paris Diane Marie Dive", 22],
]

cities = [
    "Boston",
    "Cambridge",
    "Bruxelles",
    "New York",
    "London",
    "Amsterdam",
    "Madrid",
    "Tokyo",
    "Paris",
]
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
# here we modify the liked_names to include word boundaries.
liked_names = [r"\b" + n + r"\b" for n in liked_names]
df = pd.DataFrame(l, columns=["Students", "Age"])

pat = f'({"|".join(cities)})'
df["Cities"] = df["Students"].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df["Liked Names"] = df["Students"].str.extract(pat, expand=False)

print(df)

A nicer solution would be to include the word boundaries in the creation of the regular expression.

I first tried using \s, i.e. whitespace, but that did not work at the end of the list, so \b was the solution. You can check https://regular-expressions.mobi/wordboundaries.html?wlr=1 for some details.

Upvotes: 1

Ynjxsjmh
Ynjxsjmh

Reputation: 30070

You can do an additional check to see if matched name is in Students column.

import numpy as np

def check(row):
    if row['Liked Names'] == row['Liked Names']:
        # If `Liked Names` is not nan

        # Get all possible names
        patterns = row['Students'].split(' ')

        # If matched `Liked Names` in `Students`
        isAllMatched = all([name in patterns for name in row['Liked Names'].split(' ')])

        if not isAllMatched:
            return np.nan
        else:
            return row['Liked Names']
    else:
        # If `Liked Names` is nan, still return nan
        return np.nan

df['Liked Names'] = df.apply(check, axis=1)
# print(df)

                     Students  Age     Cities   Liked Names
0           Boston Terry Emma   23     Boston          Emma
1      Tommy Julien Cambridge   20  Cambridge  Tommy Julien
2                      London   21     London           NaN
3                New York Liu   30   New York           NaN
4  Anna-Madrid+       Pauline   26     Madrid       Pauline
5         Mozart    Cambridge   27  Cambridge           NaN
6             Gigi Tokyo Lily   18      Tokyo           NaN
7      Paris Diane Marie Dive   22      Paris           NaN

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195623

You can try this regex:

liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]

pat = (
    "(" + "|".join(r"[a-zA-Z]*{}[a-zA-Z]*".format(n) for n in liked_names) + ")"
)

df["Liked Names"] = df["Students"].str.extract(pat)
print(df)

Prints:

                     Students  Age   Liked Names
0           Boston Terry Emma   23          Emma
1      Tommy Julien Cambridge   20  Tommy Julien
2                      London   21           NaN
3                New York Liu   30           Liu
4  Anna-Madrid+       Pauline   26       Pauline
5         Mozart    Cambridge   27           NaN
6             Gigi Tokyo Lily   18           NaN
7      Paris Diane Marie Dive   22           NaN

Upvotes: 0

Related Questions