Reputation: 8844
I couldnt find a solution in stackoverflow for replacing based on dictionary where the values are in a list.
Dictionary
dct = {"LOL": ["laught out loud", "laught-out loud"],
"TLDR": ["too long didn't read", "too long; did not read"],
"application": ["app"]}
Input
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
("laught-out loud so I couldnt too long; did not read"),
("what happened?")], columns=['text'])
Expected output
output_df = pd.DataFrame([("haha TLDR and LOL :D"),
("LOL so I couldnt TLDR"),
("what happened?")], columns=['text'])
Edit
Added a additional entry to the dictionary i.e. "application": ["app"]
The current solutions are giving output as "what happlicationened?"
Please suggest a fix.
Upvotes: 3
Views: 525
Reputation: 402483
Build an inverted mapping and use Series.replace
with regex=True
.
mapping = {v : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)
print(input_df)
text
0 haha TLDR and LOL :D
1 LOL so I couldnt TLDR
Where,
print(mapping)
{'laught out loud': 'LOL',
'laught-out loud': 'LOL',
"too long didn't read": 'TLDR',
'too long; did not read': 'TLDR'}
To match full words, add word boundaries to each word:
mapping = {rf'\b{v}\b' : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)
print(input_df)
text
0 haha TLDR and LOL :D
1 LOL so I couldnt TLDR
2 what happened?
Where,
print(mapping)
{'\\bapp\\b': 'application',
'\\blaught out loud\\b': 'LOL',
'\\blaught-out loud\\b': 'LOL',
"\\btoo long didn't read\\b": 'TLDR',
'\\btoo long; did not read\\b': 'TLDR'}
Upvotes: 6
Reputation: 2939
I think the most logical place to start is to reverse your dictionary so your key is your original string which maps to the value of your new string. You can either do that by hand or a million other ways like:
import itertools
dict_rev = dict(itertools.chain.from_iterable([list(zip(v, [k]*len(v))) for k, v in dct.items()]))
Which isn't super readable. Or this one which looks better and I stole from another answer:
dict_rev = {v : k for k, V in dct.items() for v in V}
This requires that each of the values in your dictionary is within a list (or other iterable) e.g. "new key": ["single_val"]
otherwise it will explode each character in the string.
You can then do the following (based on the code here How to replace multiple substrings of a string?)
import re
rep = dict((re.escape(k), v) for k, v in dict_rev.items())
pattern = re.compile("|".join(rep.keys()))
input_df["text"] = input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])
This method performs roughly 3 times faster than the simpler more elegant solution:
Simple:
%timeit input_df["text"].replace(dict_rev, regex=True)
425 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Faster:
%timeit input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])
160 µs ± 7.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Upvotes: 1
Reputation: 82765
Using df.apply
and a custom function
Ex:
import pandas as pd
def custReplace(value):
dct = {"LOL": ["laught out loud", "laught-out loud"],
"TLDR": ["too long didn't read", "too long; did not read"]
}
for k, v in dct.items():
for i in v:
if i in value:
value = value.replace(i, k)
return value
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
("laught-out loud so I couldnt too long; did not read")], columns=['text'])
print(input_df["text"].apply(custReplace))
Output:
0 haha TLDR and LOL :D
1 LOL so I couldnt TLDR
Name: text, dtype: object
or
dct = {"LOL": ["laught out loud", "laught-out loud"],
"TLDR": ["too long didn't read", "too long; did not read"]
}
dct = { "(" + "|".join(v) + ")": k for k, v in dct.items()}
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
("laught-out loud so I couldnt too long; did not read")], columns=['text'])
print(input_df["text"].replace(dct, regex=True))
Upvotes: 1
Reputation: 3926
Here is how i will go:
import pandas as pd
dct = {"LOL": ["laught out loud", "laught-out loud"],
"TLDR": ["too long didn't read", "too long; did not read"]
}
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
("laught-out loud so I couldnt too long; did not read")], columns=['text'])
dct_inv = {}
for key, vals in dct.items():
for val in vals:
dct_inv[val]=key
dct_inv
def replace_text(input_str):
for key, val in dct_inv.items():
input_str = str(input_str).replace(key, val)
return input_str
input_df.apply(replace_text, axis=1).to_frame()
Upvotes: 1