Reputation: 93
Assume that I have a DataFrame with dates in European Languages:
In[0]: df_date
Out[0]:
date
0 ene 1 2021
1 1 ene 2021
2 mar. 1 2021
3 5 mars. 2021
4 10 abr 2021
5 abr. 1 2021
And I have another DataFrame with corresponding string mappings
In[0]: df_translate
Out[0]:
foreign eng
0 ene jan
1 feb feb
2 mar mar
3 abr apr
4 may may
5 mars mar
The way I go about replacing these is usually with converting the mapping DataFrame to a dictionary, and then use Series.str.replace()
date_replace_dict = df_translate.set_index('foreign',drop=True).to_dict()['eng']
for k,v in date_replace_dict.items():
df_date['date'] = df['date'].str.replace(k,v,regex=False)
I have found this process to be a bit slow, I was wondering if there is a better/faster more pythonic way of doing it.
I cannot use the Series.replace()
method because apparently it requires an exact match on the string.
Also please note that, this is a more general question, in the sense that I happen to do this operation a lot on different types of columns containing different types of strings and substrings. So the substrings to be replaced are not necessarily at the start of the string or in any other algorithmically determinable place.
If this is a good way of going about it, then it is fine. But python tends to make me feel like I am doing something very wrong whenever I use a loop for achieveing something :)
Upvotes: 2
Views: 306
Reputation: 2743
I would consider writing it as just one regex substitution like this, to avoid copying the whole Series multiple times. It matters more, the more substitutions you need to apply on each column, of course.
import re
pattern = r"\b(" + "|".join(re.escape(k) for k in date_replace_dict) + r")\b"
df['date'].str.replace(pattern, lambda m: date_replace_dict[m.group(0)], regex=True)
Here, using word boundary \b
matchers since it might be required if the words to replace are themselves substrings of each other, but it depends on use case how these should be applied.
Upvotes: 1
Reputation: 7627
import pandas as pd
df_date = pd.DataFrame({'date':
['ene 1 2021',
'1 ene 2021',
'mar. 1 2021',
'5 abr. 2021',
'10 abr 2021',
'abr. 1 2021']})
df_translate = pd.DataFrame(
{'foreign': ['ene', 'feb', 'mar', 'abr', 'mar', 'jun'], 'eng': ['jan', 'feb', 'mar', 'apr', 'may', 'jun']})
d = df_translate.set_index('foreign', drop=True).to_dict()['eng']
print(df_date['date'].replace(to_replace=d, value=None, regex=True))
Output
0 jan 1 2021
1 1 jan 2021
2 may. 1 2021
3 5 apr. 2021
4 10 apr 2021
5 apr. 1 2021
Name: date, dtype: object
Upvotes: 1
Reputation: 119
Take a look at dateparser. Dateparser will do all the work for you in converting the string to a datetime object. Especially in your case, when all thestrings are formatted differently, and there are multiple languages, then using dateparser would proabably be the easiest thing to do.
>>> import dateparser as dp
>>> dp.parse("10 Jan 2020")
datetime.datetime(2020, 1, 10, 0, 0)
>>> dp.parse("ene 1 2021")
datetime.datetime(2021, 1, 1, 0, 0)
>>> dp.parse("abr. 1 2021")
datetime.datetime(2021, 4, 1, 0, 0)
>>> dp.parse("10 abr 2021")
datetime.datetime(2021, 4, 10, 0, 0)
>>> dp.parse("2020 dic 25")
datetime.datetime(2020, 12, 25, 0, 0)
>>> dp.parse("mart 25 14") # mart is dutch for march
datetime.datetime(2014, 3, 25, 0, 0)
You could just apply the dateparser.parse()
function to your column of string dates and you'd be done!
In other words, I would completely get rid of your mapping and try dateparser instead.
Here is a list of all of the languages and locales that they support.
Upvotes: 1