Reputation: 795
Looking to validate the data in a csv file using Python and Pandas. Everything works fine when feeding clean data. However when there is a problem with the data, the issues are hard to find. Raising any sort of error would be great. Here is some pseudo code:
dtypes = {'Date': 'str', 'yesno': 'str', 'int_val': 'int', 'decimal_value': 'str'}
df = pd.read_csv(filename, dtype=dtypes)
# Ensure exceptions are thrown for invalid data.
# valid date format in date. ValueError raised for invalid data.
pd.to_datetime(df['Date'])
# 'yes' or 'no' for the yesno field. (has to be in a list of values)
# valid integer for int_val.
# a valid integer or decimal number for decimal_value
I am not even sure the pd.to_datetime is the best way to validate a date. What is a good way of doing this?
Upvotes: 5
Views: 8586
Reputation: 3130
'yes' or 'no' for the yesno field. (has to be in a list of values):
df.yesno.isin(['yes','no']).all() # Returns False if there are any other values
valid integer for int_val:
df.int_val.astype(int) # Throws an error if there are non-integers
# or, if int_val are floats:
import numpy as np
np.isclose(df.int_val.round(0),df.int_val.astype(int)).all()
a valid integer or decimal number for decimal_value:
df.decimal_value.astype(float) # similar to above
Using pd.to_datetime()
to validate dates is probably best; you can also specify the format of the dates if necessary, e.g. with the keyword argument format = '%y-%m-%d'
which expects dates to be of the form yyyy-mm-dd
.
Upvotes: 4