Reputation: 173
I have a column of strings/Nones that I'd like to convert to date. But, if there's an element that I can't convert to date (in my format), I'd like None there, but I also need to know it failed to convert(i.e. it's not None because it was there as None originally, but was incorrect entry to start with)
I could, of course, return say an error string in the conversion function, like:
import pandas as pd
import datetime
def convert(x):
if x is None:
return None
try:
d = datetime.datetime.strptime(x, '%Y%m%d')
except:
d = f"Error converting {x}"
return d
s1 = pd.Series(["2021228", "2021228", "2021228", None, "x"])
s2 = s1.apply(convert)
print(s2)
0 2021-02-28 00:00:00
1 2021-02-28 00:00:00
2 2021-02-28 00:00:00
3 None
4 Error converting x
dtype: object
But then I need to check for all the elements of the column that are error string, and, because I need them listed, would need to separate them into their own column/series.
Which TBH seems like a major pain to me, as it means I'd have to run map/apply or something similar again, so two passes for each column I need to verify.
Edit: And, in a more generic way, I don't like checking for a specific string in the first place. If it was an operation on strings, not dates, it'd be rather unreliable.
I have tried to return a Series from the apply, with labels value and error, but if there is an error (i.e. a None value in the value label) , all datetimes get converted into a float, and I can't figure out why (I looked into pandas' code, it marks it as mixed type, and I suspect does some conversion somewhere).
import pandas as pd
import datetime
def convert(x):
v = None
e = None
if x is not None:
try:
v = datetime.datetime.strptime(x, '%Y%m%d')
except:
e = f"Error converting {x}"
return pd.Series({"value":v, "error":e})
s1 = pd.Series(["2021228", "2021228", "2021228", None, "x"])
s2 = s1.apply(convert)
print(s2["value"])
0 1.614470e+18
1 1.614470e+18
2 1.614470e+18
3 NaN
4 NaN
Any ideas?
Upvotes: 0
Views: 51
Reputation: 173
Actually, the easiest way seem to be to use transform(), with two functions, one which does the transformation and the other produces the error. I don't know whether it's the most efficient, but it works ok on the relatively small datasets I have.
something like
def value(x):
return convert(x) if valid(x) else None
def error(x):
return None if valid(x) else f"{x} is causing an error!"
s = s1.transform([value, error])
s["value"] then has the right values (or Nones), s["error"] has the errors.
Upvotes: 1