Reputation: 2889
I have a large dataframe containing, amongst other things, a (Norwegian) social security number. It is possible to get the date of birth out of this number via a special algorithm. However, every now and then an illegal social security number creeps into the database corrupting the calculation.
What I would like to do is to tag every line having an illegal social security number, along with a log message showing the error raised.
Consider the following, constructed, example
import pandas as pd
from datetime import date
sample_data = pd.DataFrame({'id' : [1, 2, 3], \
'sec_num' : [19790116, 19480631, 19861220]})
# The actual algorithm transforming the sec number is more complicated
# this is just for illustration purposes
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError:
raise ValueError("Value:{0} not a legal date.".format(argdate))
I want to create the following output
id sec_num date_of_birth is_in_error error_msg
0 1 19790116 1979-01-16 False
1 2 19480631 None True 19480631 is not a legal date
2 3 19861220 1986-12-20 False
I have tried
try:
sample_data['date_of_birth'] = [int2date(sec_num) for \
sec_num in sample_data['sec_num']]
sample_data['is_in_error'] = False
sample_data['error_msg'] = ''
except ValueError as e:
sample_data['is_in_error'] = True
sample_data['error_msg'] = str(e)
but this produces this
id sec_num is_in_error error_msg
0 1 19790116 True Value:19480631 not a legal date.
1 2 19480631 True Value:19480631 not a legal date.
2 3 19861220 True Value:19480631 not a legal date.
I guess the problem is that I assign the date_of_birth
-column in one operation and the errors in another. I am not sure how to catch and create the is_in_error
and error_msg
columns simulationously.
Upvotes: 8
Views: 372
Reputation: 3738
This happens because of the way you're populating the dataframe.
sample_data['error_msg'] = str(e)
Will actually overwrite the entire column with str(e)
.
This is probably the most efficient way to do it:
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError as e:
pass # you could write the row and the error to your logs here
df['date_of_birth'] = df.sec_num.apply(int2date)
df['is_in_error'] = df.date_of_birth.isnull()
However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this).
df['date_of_birth'] = None
df['error_msg'] = None
df['is_in_error'] = False
for i, row in df.iterrows():
try:
date_of_birth = int2date(row['sec_num'])
df.set_value(i, 'date_of_birth', date_of_birth)
except ValueError as e:
df.set_value(i, 'is_in_error', True)
df.set_value(i, 'error_msg', str(e))
This handles each row separately and will only write the error to the correct index instead of updating the entire column.
Upvotes: 4
Reputation: 59506
You are in the realm of handling large data. Throwing exceptions out of a loop is often not the best idea there because it will normally abort the loop. As many others you do not seem to want that.
To achieve that a typical approach is to use a function which does not throw the exception but which returns it instead.
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError:
return ValueError("Value:{0} not a legal date.".format(argdate))
With this you simply can map a list of values to the function and will receive the exceptions (which lack a trace of course, but in such a case this should not be a problem) as values in the result list:
You then can walk over the list, replace the found exceptions by None
values and fill other columns instead with the message contained in the exception.
Upvotes: 2