Reputation: 3987
I have a dataframe like this:
p q
0 jdkdn 01JAN2020
1 01JAN2020 hdk789dj
2 783i3 01FEB2020
I am trying to replace 01JAN2020
by 01-01-2020
, basically alphabetical month name
to numeric month number
. I don't know where these date
are, mean in any column and any row. I have tried to do it by just using df.replace
but having problem with raw string and I want to use f-string as well.
Here's my try:
import pandas as pd
import re
df=pd.DataFrame({"p":["jdkdn","01JAN2020","783i3"],"q":["01JAN2020","hdk789dj","01FEB2020"]})
months = {'JAN': '01', 'FEB': '02'}
#All 12 months but for test I am using just 2
df.replace(r'(\d{2})('+'|'.join(months)+')(\d{4})',r"\1-"+
f"{{months[\\2]}}" # Here I am trying to do main things
+r"-\3",regex=True,inplace=True)
print(df)
"""
p q
0 jdkdn 01-{months[JAN]}-2020
1 01-{months[JAN]}-2020 hdk789dj
2 783i3 01-{months[FEB]}-2020
"""
fr"{{months[\\2]}}"
"""
p q
0 jdkdn 01-{months[\2]}-2020
1 01-{months[\2]}-2020 hdk789dj
2 783i3 01-{months[\2]}-2020
"""
rf"{months[\2]}" and rf"{months[\\2]}"
"""
SyntaxError: f-string expression part cannot include a backslash
"""
Idea:
Can we use regex if
like (?(<month name>)<number>)
?, Any function in pandas that will detect any 01JAN2020
replace that with 01-01-2020
.
Expected output is:
p q
0 jdkdn 01-01-2020
1 01-01-2020 hdk789dj
2 783i3 01-02-2020
Upvotes: 1
Views: 239
Reputation: 28709
You can use pandas str. replace, simply create a pattern with names, and the replacements, and use transform
to apply it to all the columns:
pat = r"(?P<day>\d+)(?P<month>[A-Z]+)(?P<year>\d+)"
repl = lambda m: f"{m.group('day')}-{months[m.group('month')]}-{m.group('year')}"
df.transform(lambda x: x.str.replace(pat, repl, regex=True))
p q
0 jdkdn 01-01-2020
1 01-01-2020 hdk789dj
2 783i3 01-02-2020
Note that for strings, if you want more performance, you could write it within python and apply. You can use re.sub
for that, it should give noticeable improvements (pls test):
df.applymap(lambda x: re.sub(pat, repl, x))
If your data is duplicated, an efficient approach would be to convert to a Categorical and apply the string modification to the categories; that should be faster.
Kindly note that nulls are covered implicitly by pandas string functions, within python, your code should cover these possible scenarios.
Upvotes: 3