Reputation: 33
Hi I am trying to remap a Dataframe using a dictionary in Python Pandas but I need to use regex to make things work fine.
Here is a sample of the dict:
di_cities = {
"Ain Salah (town)": "Ain Salah"
"Agadez town": "Agadez"
"Bamako city": "Bamako",
"Birnin Konni town": "Birni N Konni",
"Konni": "Birni N Konni",
"Kadunà": "Kaduna",
"Kaduna (city)": "Kaduna",
"Kano (city)": "Kano"
"Matamey": "Matamey",
"Mopti city": "Mopti"
"N'guigmi": "Nguigmi",
"Tunis": "Tunis",
"Tunis (city)": "Tunis"
}
I am using this iteration:
di_cities = {rf"\b{k}\b": v for k, v in di_cities.items()}
df_cities_clean = df.replace(di_cities, regex=True)
As you can see in the pic (final result) it works fine for Bamako, Agadez, Mopti and every sigle-word string. Doesn't for any string with parentheses and in case of Birnin Konni messes up a little bit.
I am using another dictionary in a similar way but there every string is between parentheses and {rf"\({k}\)"
works perfectly.
Can you help me?
Upvotes: 3
Views: 349
Reputation: 626689
I suggest using
di_cities = {rf"\b{re.escape(k)}(?:(?<=\w)\b|(?<!\w))": v for k, v in di_cities.items()}
With this dictionary comprehension, you create another dictionary with keys as regular expressions matching former keys as whole words that start with word characters (that is, digits, letters, underscores, connector punctuation) and - if they end with word chars - are not immediately followed with another word char. If a key does not end with a word char, say, if it ends with punctuation, or whitespace (maybe adding .strip()
would make it safer), no additional boundary check is applied.
The rf"\b{re.escape(k)}(?:(?<=\w)\b|(?<!\w))"
escapes [all special regex metacharacters in] the key first, and then prepends it with a word boundary, and (?:(?<=\w)\b|(?<!\w))
is a non-capturing group that matches
(?<=\w)\b
- a word boundary if the preceding char is a word char ((?<!...)
is a positive lookbehind)|
- or(?<!\w))
- no additional check (empty string is matched) if there is no word char immediately to the left of the current location ((?<!...)
is a negative lookbehind).Upvotes: 1