Reputation: 8844

How to replace group of strings in pandas series based on a dictionary with values as list?

I couldnt find a solution in stackoverflow for replacing based on dictionary where the values are in a list.

Dictionary

dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"],
        "application": ["app"]}

Input

input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
                         ("laught-out loud so I couldnt too long; did not read"),
                         ("what happened?")], columns=['text'])

Expected output

output_df = pd.DataFrame([("haha TLDR and LOL :D"),
                          ("LOL so I couldnt TLDR"),
                          ("what happened?")], columns=['text'])

Edit

Added a additional entry to the dictionary i.e. "application": ["app"]

The current solutions are giving output as "what happlicationened?"

Please suggest a fix.

Upvotes: 3

Answers (4)

cs95

Reputation: 402483

Build an inverted mapping and use Series.replace with regex=True.

mapping = {v : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)

print(input_df)
                    text
0   haha TLDR and LOL :D
1  LOL so I couldnt TLDR

Where,

print(mapping)
{'laught out loud': 'LOL',
 'laught-out loud': 'LOL',
 "too long didn't read": 'TLDR',
 'too long; did not read': 'TLDR'}

To match full words, add word boundaries to each word:

mapping = {rf'\b{v}\b' : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)

print(input_df)
                    text
0   haha TLDR and LOL :D
1  LOL so I couldnt TLDR
2         what happened?

Where,

print(mapping)
{'\\bapp\\b': 'application',
 '\\blaught out loud\\b': 'LOL',
 '\\blaught-out loud\\b': 'LOL',
 "\\btoo long didn't read\\b": 'TLDR',
 '\\btoo long; did not read\\b': 'TLDR'}

Upvotes: 6

Sven Harris

Reputation: 2939

I think the most logical place to start is to reverse your dictionary so your key is your original string which maps to the value of your new string. You can either do that by hand or a million other ways like:

import itertools
dict_rev = dict(itertools.chain.from_iterable([list(zip(v, [k]*len(v))) for k, v in dct.items()]))

Which isn't super readable. Or this one which looks better and I stole from another answer:

dict_rev = {v : k for k, V in dct.items() for v in V}

This requires that each of the values in your dictionary is within a list (or other iterable) e.g. "new key": ["single_val"] otherwise it will explode each character in the string.

You can then do the following (based on the code here How to replace multiple substrings of a string?)

import re
rep = dict((re.escape(k), v) for k, v in dict_rev.items())
pattern = re.compile("|".join(rep.keys()))
input_df["text"] = input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])

This method performs roughly 3 times faster than the simpler more elegant solution:

Simple:

%timeit input_df["text"].replace(dict_rev, regex=True)

425 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Faster:

%timeit input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])

160 µs ± 7.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Upvotes: 1

Rakesh

Reputation: 82765

Using df.apply and a custom function

Ex:

import pandas as pd


def custReplace(value):
    dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"]
        }

    for k, v in dct.items():
        for i in v:
            if i in value:
                value = value.replace(i, k)
    return value

input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
       ("laught-out loud so I couldnt too long; did not read")], columns=['text'])

print(input_df["text"].apply(custReplace))

Output:

0     haha TLDR and LOL :D
1    LOL so I couldnt TLDR
Name: text, dtype: object

dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"]
        }

dct = { "(" + "|".join(v) + ")": k for k, v in dct.items()}
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
       ("laught-out loud so I couldnt too long; did not read")], columns=['text'])

print(input_df["text"].replace(dct, regex=True))

Upvotes: 1

quest

Reputation: 3926

Here is how i will go:

import pandas as pd


dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"]
        }

input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
       ("laught-out loud so I couldnt too long; did not read")], columns=['text'])

dct_inv = {}
for key, vals in dct.items():
    for val in vals:
        dct_inv[val]=key

dct_inv

def replace_text(input_str):
    for key, val in dct_inv.items():
        input_str = str(input_str).replace(key, val)
    return input_str

input_df.apply(replace_text, axis=1).to_frame()

Upvotes: 1

How to replace group of strings in pandas series based on a dictionary with values as list?

Answers (4)

Related Questions