Reputation: 25
I'm trying to apply a regex function to a column of a dataframe to determine gender pronouns. This is what my dataframe looks like:
name Descrip
0 Sarah she doesn't like this because her mum...
1 David he does like it because his dad...
2 Sam they generally don't like it because their par...
These are the codes I ran to make that dataframe:
list_label = ["Sarah", "David", "Sam"]
list_descriptions = ["she doesn't like this because her mum...", "he does like it because his dad...", "they generally don't like it because their parent..."]
data3 = {'name':list_label, 'Descrip':list_descriptions}
test_df = pd.DataFrame(data3)
I'm trying to determine the genders of the person by applying a regex function on the "Descrip" column. Specifically, these are the patterns I want to implement:
"male":"(he |his |him )",
"female":"(she |her |hers )",
"plural, or singular non-binary":"(they |them |their )"
The full code I've written is as follows:
This function attempts to match each pattern and returns the name of the gender pronoun mentioned most often in a row value description. Each gender pronoun has several key words in a pattern string (eg. him, her, they).The idea is to determine the max_gender, or the gender associated with the pattern group most mentioned throughout the values in the Descrip column. Thus, max_gender can take on one of three values: male | female | plural, or singular non-binary. If none of the patterns are identified throughout the Descrip row values, then "unknown" will be returned instead.
import re
def get_pronouns(text):
patterns = {
"male":"(he |his |him )",
"female":"(she |her |hers )",
"plural, or singular non-binary":"(they |them |their )"
}
max_gender = "unknown"
max_gender_count = 0
for gender in patterns:
pattern = re.compile(gender)
mentions = re.findall(pattern, text)
count_mentions = len(mentions)
if count_mentions > max_gender_count:
max_gender_count = count_mentions
max_gender = gender
return max_gender
test_df["pronoun"] = test_df.loc[:, "Descrip"].apply(get_pronouns)
print(test_df)
However, when I run the code, it obviously fails to determine the gender pronoun. This is shown in the following output:
name Descrip pronoun
0 Sarah she doesn't like this because her mum... unknown
1 David he does like it because his dad... unknown
2 Sam they generally don't like it because their par... unknown
Does anyone know what is wrong with my code?
Upvotes: 2
Views: 319
Reputation: 2702
If you want to discover why your code isn't working, add a print statement to your function like so:
for gender in patterns:
print(gender)
pattern = re.compile(gender)
Your regex also needs some tweaks. For example, in the first line of the song Breathe by Pink Floyd, Breathe, breathe in the air, your regex finds two male pronouns.
There may be other problems too, I'm not sure.
Here is a solution quite similar to yours. The regex are fixed, the dictionary is replaced by a list of tuples, etc.
import pandas as pd
import numpy as np
import re
import operator as op
names_list = ['Sarah', 'David', 'Sam']
descs_list = ["she doesn't like this because her mum...", 'he does like it because his dad...',
"they generally don't like it because their parent..."]
df_1 = pd.DataFrame(data=zip(names_list, descs_list), columns=['Name', 'Desc'])
pronoun_re_list = [('male', re.compile(r"\b(?:he|his|him)\b", re.IGNORECASE)),
('female', re.compile(r"\b(?:she|her|hers)\b", re.IGNORECASE)),
('plural/nb', re.compile(r"\b(?:they|them|their)\b", re.IGNORECASE))]
def detect_pronouns(str_in: str) -> str:
match_results = ((curr_pron, len(curr_patt.findall(str_in))) for curr_pron, curr_patt in pronoun_re_list)
max_pron, max_counts = max(match_results, key=op.itemgetter(1))
if max_counts == 0:
return np.NaN
else:
return max_pron
df_1['Pronouns'] = df_1['Desc'].map(detect_pronouns)
match_results
is a generator expression. curr_pron
stands for "current pronoun", and curr_patt
for "current pattern". It might make things clearer if I rewrite it as a for loop which creates a list:
match_results = []
for curr_pron, curr_patt in pronoun_re_list:
match_counts = len(curr_patt.findall(str_in))
match_results.append((curr_pron, match_counts))
for curr_pron, curr_patt in ...
is taking advantage of something which goes by a few different names, usually multiple assignment or tuple unpacking. You can find a nice article on it here. In this case, it's just a different way of writing:
for curr_tuple in pronoun_re_list:
curr_pron = curr_tuple[0]
curr_patt = curr_tuple[1]
Time for everyone's favorite subject; Regex! I use a wonderful website called RegEx101, you can mess around with the patterns there, it makes things so much easier to understand. I have set up a page with some test data and the regex I'll be covering below: https://regex101.com/r/Y1onRC/2.
Now, let's take a look at the regex I used: \b(?:he|his|him)\b
.
The he|his|him
part is exactly like in yours, it matches the words 'he', 'his' or 'him'. In your regex, that is surrounded by parentheses, mine also includes ?:
after the opening parenthesis. (pattern stuff)
is a capturing group, which as the name implies, means it captures whatever it matches. Since here we don't actually care about the contents of the matches, only whether there is or isn't a match, we add ?:
to create a non-capturing group, which doesn't capture (or save) the contents.
I said that the he|his|him
part of the regex is the same as yours, but that isn't exactly true. You include a space after each pronoun, presumably to avoid it matching the letters he
in the middle of a word. Unfortunately, as I mentioned above, it finds two matches in the sentence Breathe, breathe in the air. Our saviour is \b
, which matches word boundaries. This means we catch the he in Words words words he., whereas (he |his |him )
doesn't.
Finally, we compile the patterns with the re.IGNORECASE
flag, which I don't think requires much explanation, although please do let me know if I'm wrong.
Here is how I would describe the two patterns in plain english:
(he |his |him )
matches the letters he followed by a space, his followed by a space, or him followed by a space, and returns the full match plus a group.\b(?:he|his|him)\b
with the re.IGNORECASE
flag matches the words he, his, or him, regardless of case, and returns the full match.Hope that was clear enough, let me know!
Name Desc Pronouns
-- ------ ---------------------------------------------------- ----------
0 Sarah she doesn't like this because her mum... female
1 David he does like it because his dad... male
2 Sam they generally don't like it because their parent... plural/nb
Let me know if you have any questions :)
Upvotes: 2