Guguma
Guguma

Reputation: 93

Fastest method to replace substring from a Dataframe from key value pairs

Assume that I have a DataFrame with dates in European Languages:

In[0]: df_date
Out[0]:
    date
0   ene 1 2021
1   1 ene 2021
2   mar. 1 2021
3   5 mars. 2021
4   10 abr 2021
5   abr. 1 2021

And I have another DataFrame with corresponding string mappings

In[0]: df_translate
Out[0]:
    foreign eng
0   ene     jan
1   feb     feb
2   mar     mar
3   abr     apr
4   may     may
5   mars    mar

The way I go about replacing these is usually with converting the mapping DataFrame to a dictionary, and then use Series.str.replace()

date_replace_dict = df_translate.set_index('foreign',drop=True).to_dict()['eng']
    
for k,v in date_replace_dict.items():

    df_date['date'] = df['date'].str.replace(k,v,regex=False)

I have found this process to be a bit slow, I was wondering if there is a better/faster more pythonic way of doing it.

I cannot use the Series.replace() method because apparently it requires an exact match on the string.

Also please note that, this is a more general question, in the sense that I happen to do this operation a lot on different types of columns containing different types of strings and substrings. So the substrings to be replaced are not necessarily at the start of the string or in any other algorithmically determinable place.

If this is a good way of going about it, then it is fine. But python tends to make me feel like I am doing something very wrong whenever I use a loop for achieveing something :)

Upvotes: 2

Views: 306

Answers (3)

creanion
creanion

Reputation: 2743

I would consider writing it as just one regex substitution like this, to avoid copying the whole Series multiple times. It matters more, the more substitutions you need to apply on each column, of course.

import re

pattern = r"\b(" + "|".join(re.escape(k) for k in date_replace_dict) + r")\b"

df['date'].str.replace(pattern, lambda m: date_replace_dict[m.group(0)], regex=True)

Here, using word boundary \b matchers since it might be required if the words to replace are themselves substrings of each other, but it depends on use case how these should be applied.

Upvotes: 1

Алексей Р
Алексей Р

Reputation: 7627

import pandas as pd

df_date = pd.DataFrame({'date':
                            ['ene 1 2021',
                             '1 ene 2021',
                             'mar. 1 2021',
                             '5 abr. 2021',
                             '10 abr 2021',
                             'abr. 1 2021']})

df_translate = pd.DataFrame(
    {'foreign': ['ene', 'feb', 'mar', 'abr', 'mar', 'jun'], 'eng': ['jan', 'feb', 'mar', 'apr', 'may', 'jun']})
d = df_translate.set_index('foreign', drop=True).to_dict()['eng']
print(df_date['date'].replace(to_replace=d, value=None, regex=True))

Output

0     jan 1 2021
1     1 jan 2021
2    may. 1 2021
3    5 apr. 2021
4    10 apr 2021
5    apr. 1 2021
Name: date, dtype: object

Upvotes: 1

Nico Bako
Nico Bako

Reputation: 119

Take a look at dateparser. Dateparser will do all the work for you in converting the string to a datetime object. Especially in your case, when all thestrings are formatted differently, and there are multiple languages, then using dateparser would proabably be the easiest thing to do.

>>> import dateparser as dp
>>> dp.parse("10 Jan 2020")
    datetime.datetime(2020, 1, 10, 0, 0)
>>> dp.parse("ene 1 2021")
    datetime.datetime(2021, 1, 1, 0, 0)
>>> dp.parse("abr. 1 2021")
    datetime.datetime(2021, 4, 1, 0, 0)
>>> dp.parse("10 abr 2021")
    datetime.datetime(2021, 4, 10, 0, 0)
>>> dp.parse("2020 dic 25")
    datetime.datetime(2020, 12, 25, 0, 0)
>>> dp.parse("mart 25 14") # mart is dutch for march
    datetime.datetime(2014, 3, 25, 0, 0)

You could just apply the dateparser.parse() function to your column of string dates and you'd be done!

In other words, I would completely get rid of your mapping and try dateparser instead.

Here is a list of all of the languages and locales that they support.

Upvotes: 1

Related Questions