Reputation: 9
I am trying create a python dictionary to reference 'WHM1',2,3, 'HISPM1',2,3, etc. and other iterations to create a new column with a specific string for ex. White or Hispanic. Using regex seems like the right path but I am missing something here and refuse to hard code the whole thing in the dictionary.
I have tried several iterations of regex and regexdict :
d = regexdict({'W*':'White', 'H*':'Hispanic'})
eeoc_nac2_All_unpivot_df['Race'] =
eeoc_nac2_All_unpivot_df['EEOC_Code'].map(d)
A new column will be created with 'White'
or 'Hispanic'
for each row based on what is in an existing column called 'EEOC_Code'
.
Upvotes: 0
Views: 76
Reputation: 189377
Your regular expressions are wrong - you appear to be using glob syntax instead of proper regular expressions.
In regex, x*
means "zero or more of x
" and so both your regexes will trivially match the empty string. You apparently mean
d = regexdict({'^W':'White', '^H':'Hispanic'})
instead, where the regex anchor ^
matches beginning of string.
There are several third-party packages 1, 2, 3 named regexdict
so you should probably point out which one you use. I can't tell whether the ^
is necessary here, or whether the regexes need to match the input completely (I have assumed a substring match is sufficient, as is usually the case in regex) because this sort of detail may well differ between implementations.
Upvotes: 1
Reputation: 2280
I'm not sure to have completely understood your problem. However, if all your labels have structure WHM... and HISP..., then you can simply check the first character:
for race in eeoc_nac2_All_unpivot_df['EEOC_Code']:
if race.startswith('W'):
eeoc_nac2_All_unpivot_df['Race'] = "White"
else:
eeoc_nac2_All_unpivot_df['Race'] = "Hispanic"
Note: it only works if what you have inside eeoc_nac2_All_unpivot_df['EEOC_Code']
is iterable.
Upvotes: 0