Benjamin Png
Benjamin Png

Reputation: 25

How to apply regex function to dataframe column to return value

I'm trying to apply a regex function to a column of a dataframe to determine gender pronouns. This is what my dataframe looks like:

    name                                            Descrip
0  Sarah           she doesn't like this because her mum...
1  David                 he does like it because his dad...
2    Sam  they generally don't like it because their par...

These are the codes I ran to make that dataframe:

list_label = ["Sarah", "David", "Sam"]
list_descriptions = ["she doesn't like this because her mum...", "he does like it because his dad...", "they generally don't like it because their parent..."]

data3 = {'name':list_label, 'Descrip':list_descriptions}
test_df = pd.DataFrame(data3)

I'm trying to determine the genders of the person by applying a regex function on the "Descrip" column. Specifically, these are the patterns I want to implement:

"male":"(he |his |him )",
"female":"(she |her |hers )",
"plural, or singular non-binary":"(they |them |their )"

The full code I've written is as follows:

This function attempts to match each pattern and returns the name of the gender pronoun mentioned most often in a row value description. Each gender pronoun has several key words in a pattern string (eg. him, her, they).The idea is to determine the max_gender, or the gender associated with the pattern group most mentioned throughout the values in the Descrip column. Thus, max_gender can take on one of three values: male | female | plural, or singular non-binary. If none of the patterns are identified throughout the Descrip row values, then "unknown" will be returned instead.

import re
def get_pronouns(text):
    patterns = {
        "male":"(he |his |him )",
        "female":"(she |her |hers )",
        "plural, or singular non-binary":"(they |them |their )"
    }
    max_gender = "unknown"
    max_gender_count = 0
    for gender in patterns:
        pattern = re.compile(gender)
        mentions = re.findall(pattern, text)
        count_mentions = len(mentions)
        if count_mentions > max_gender_count:
            max_gender_count = count_mentions
            max_gender = gender
    return max_gender

test_df["pronoun"] = test_df.loc[:, "Descrip"].apply(get_pronouns)
print(test_df)

However, when I run the code, it obviously fails to determine the gender pronoun. This is shown in the following output:

    name                                            Descrip  pronoun
0  Sarah           she doesn't like this because her mum...  unknown
1  David                 he does like it because his dad...  unknown
2    Sam  they generally don't like it because their par...  unknown

Does anyone know what is wrong with my code?

Upvotes: 2

Views: 319

Answers (1)

AMC
AMC

Reputation: 2702

If you want to discover why your code isn't working, add a print statement to your function like so:

    for gender in patterns:
        print(gender)
        pattern = re.compile(gender)

Your regex also needs some tweaks. For example, in the first line of the song Breathe by Pink Floyd, Breathe, breathe in the air, your regex finds two male pronouns.

There may be other problems too, I'm not sure.


Here is a solution quite similar to yours. The regex are fixed, the dictionary is replaced by a list of tuples, etc.


Solution code

import pandas as pd
import numpy as np
import re
import operator as op

names_list = ['Sarah', 'David', 'Sam']
descs_list = ["she doesn't like this because her mum...", 'he does like it because his dad...',
              "they generally don't like it because their parent..."]

df_1 = pd.DataFrame(data=zip(names_list, descs_list), columns=['Name', 'Desc'])

pronoun_re_list = [('male', re.compile(r"\b(?:he|his|him)\b", re.IGNORECASE)),
                   ('female', re.compile(r"\b(?:she|her|hers)\b", re.IGNORECASE)),
                   ('plural/nb', re.compile(r"\b(?:they|them|their)\b", re.IGNORECASE))]


def detect_pronouns(str_in: str) -> str:
    match_results = ((curr_pron, len(curr_patt.findall(str_in))) for curr_pron, curr_patt in pronoun_re_list)
    max_pron, max_counts = max(match_results, key=op.itemgetter(1))
    if max_counts == 0:
        return np.NaN
    else:
        return max_pron


df_1['Pronouns'] = df_1['Desc'].map(detect_pronouns)

Explanations

Code

match_results is a generator expression. curr_pron stands for "current pronoun", and curr_patt for "current pattern". It might make things clearer if I rewrite it as a for loop which creates a list:

    match_results = []
    for curr_pron, curr_patt in pronoun_re_list:
        match_counts = len(curr_patt.findall(str_in))
        match_results.append((curr_pron, match_counts))

for curr_pron, curr_patt in ... is taking advantage of something which goes by a few different names, usually multiple assignment or tuple unpacking. You can find a nice article on it here. In this case, it's just a different way of writing:

    for curr_tuple in pronoun_re_list:
        curr_pron = curr_tuple[0]
        curr_patt = curr_tuple[1]

RegEx

Time for everyone's favorite subject; Regex! I use a wonderful website called RegEx101, you can mess around with the patterns there, it makes things so much easier to understand. I have set up a page with some test data and the regex I'll be covering below: https://regex101.com/r/Y1onRC/2.

Now, let's take a look at the regex I used: \b(?:he|his|him)\b.

The he|his|him part is exactly like in yours, it matches the words 'he', 'his' or 'him'. In your regex, that is surrounded by parentheses, mine also includes ?: after the opening parenthesis. (pattern stuff) is a capturing group, which as the name implies, means it captures whatever it matches. Since here we don't actually care about the contents of the matches, only whether there is or isn't a match, we add ?: to create a non-capturing group, which doesn't capture (or save) the contents.

I said that the he|his|him part of the regex is the same as yours, but that isn't exactly true. You include a space after each pronoun, presumably to avoid it matching the letters he in the middle of a word. Unfortunately, as I mentioned above, it finds two matches in the sentence Breathe, breathe in the air. Our saviour is \b, which matches word boundaries. This means we catch the he in Words words words he., whereas (he |his |him ) doesn't.

Finally, we compile the patterns with the re.IGNORECASE flag, which I don't think requires much explanation, although please do let me know if I'm wrong.

Here is how I would describe the two patterns in plain english:

  • (he |his |him ) matches the letters he followed by a space, his followed by a space, or him followed by a space, and returns the full match plus a group.
  • \b(?:he|his|him)\b with the re.IGNORECASE flag matches the words he, his, or him, regardless of case, and returns the full match.

Hope that was clear enough, let me know!


Result output

    Name    Desc                                                  Pronouns
--  ------  ----------------------------------------------------  ----------
 0  Sarah   she doesn't like this because her mum...              female
 1  David   he does like it because his dad...                    male
 2  Sam     they generally don't like it because their parent...  plural/nb

Let me know if you have any questions :)

Upvotes: 2

Related Questions