Reputation: 314
I am using Jupyter Notebook, Pandas framework and Python as the programming language. I have a dataframe which is of the following shape (10500, 4). So it has 4 columns and 10500 records. Initial_Date is one out of the 4 columns which is an Object datatype. This is the type of information it contains:
Initial_Date
1971
11969
102006
03051992
00131954
27001973
45061987
1996
It is easy to make out the format of the column as DDMMYYYY (03051992 is 3rd May 1992)
Note: As you can see there are invalid MM (00 and 13) and invalid DD (00 and 45).
I would like to use regex to extract whatever is available in the field. I don't know how to read YYYY separately to MM or DD so please enlighten me here. After the extraction occurs, I would like to test whether the YYYY, DD and MM are valid. If either of them are not valid then assign NaT else DD-MM-YYYY or DD/MM/YYYY (not fussy with the end format).
For example: 051992 is considered as invalid since this becomes DD/05/1992
A field that has full 8 digits for example 10081996 is considered valid 10/08/1996
PS. I am starting out with Pandas, Jupyter notebook and slowing reviving my Python skills. FYI If you guys think there is a better way to convert each field to a valid Date datatype then please do enlighten me.
Upvotes: 1
Views: 880
Reputation: 210852
you can do it this way:
result = pd.to_datetime(d.Initial_Date.astype(str), dayfirst=True, errors='coerce')
result.ix[result.isnull()] = pd.to_datetime(d.Initial_Date.astype(str), format='%d%m%Y', dayfirst=True, errors='coerce')
#format is set to %d%m%Y
result:
In [88]: result
Out[88]:
0 1971-01-01
1 NaT
2 2006-10-20
3 1992-03-05
4 1954-01-03
5 NaT
6 NaT
7 1996-01-01
Name: Initial_Date, dtype: datetime64[ns]
original DF
In [89]: d
Out[89]:
Initial_Date
0 1971
1 11969
2 102006
3 3051992
4 131954
5 27001973
6 45061987
7 1996
Upvotes: 1