Reputation: 59
I'm trying to search in a dataframe about certain words listed in dictionary values if any exist it will replaced with the key of values.
units_dic= {'grams':['g','Grams'],
'kg' :['kilogram','kilograms']}
the problem is some units abbreviations are letters so it will replace all letters also, I want to do the replacement only if it preceded by a number to make sure it's a unit.
Dataframe
Id | test
---------
1 |'A small paperclip has a mass of about 111 g'
2 |'1 kilogram =1000 g'
3 |'g is the 7th letter in the ISO basic Latin alphabet'
Replacement Loop
x = df.copy()
for k in units_dic:
for i in range(len(x['test'])):
for w in units_dic[k]:
x['test'][i] = str(x['test'][i]).replace(str(w), str(k))
The Output
Id | test
---------
1 |'A small paperclip has a mass of about 111 grams'
2 |'1 kg =1000 grams'
3 |'grams is the 7th letter in the ISO basic Latin alphabet'
Upvotes: 2
Views: 149
Reputation: 42906
We can make use of the lookbehind
feature of regex
here, which we can specify that it needs to be preceded by a number and optional a whitespace:
for k, v in units_dic.items():
df['test'] = df['test'].str.replace(f"(?<=[0-9])\s*({'|'.join(v)})\b", f' {k}')
print(df)
Id test
0 1 'A small paperclip has a mass of about 111 grams'
1 2 '1 kg =1000 grams'
2 3 'g is the 7th letter in the ISO basic Latin al...
Explanation
First we use raw + fstring: fr'sometext'
Regular expression:
?<=[0-9]
= preceded by a number \s*
is a whitespace "|".join(v)
gives us the values in your dictionary back delimited by a |
which
is the or
operator in regexUpvotes: 0
Reputation: 51155
Regular expressions to the rescue along with flipping the dictionary.
import re
d = {i: k for k, v in units_dic.items() for i in v}
u = r'|'.join(d)
v = fr'(\d+\s?)\b({u})\b'
df.assign(test=[re.sub(v, lambda x: x.group(1) + d[x.group(2)], el) for el in df.test])
Id test
0 1 A small paperclip has a mass of about 111 grams
1 2 1 kg =1000 grams
2 3 g is the 7th letter in the ISO basic Latin alp...
Upvotes: 1
Reputation: 2032
Try:
for key, val in units_dic.items():
df['test'] = df['test'].replace("\d+[ ]*" + "|".join(val) , key , regex=True)
Upvotes: 1