Reputation: 1614
I have a situation where month and date are messed up for few dates in my dataframe. For e.g here is the input:
df['work_date'].head(15)
0 2018-01-01
1 2018-02-01
2 2018-03-01
3 2018-04-01
4 2018-05-01
5 2018-06-01
6 2018-07-01
7 2018-08-01
8 2018-09-01
9 2018-10-01
10 2018-11-01
11 2018-12-01
12 2018-01-13
13 2018-01-14
14 2018-01-15
The date is stored as a string
. As you can see, the date is in the format yyyy-dd-mm
till 12th of Jan and then becomes yyyy-mm-dd
. The dataframe consists of 3 years worth data and this pattern repeats for all months for all years.
My expected output is to standardize the date to format dddd-mm-yy
like below.
0 2018-01-01
1 2018-01-02
2 2018-01-03
3 2018-01-04
4 2018-01-05
5 2018-01-06
6 2018-01-07
7 2018-01-08
8 2018-01-09
9 2018-01-10
10 2018-01-11
11 2018-01-12
12 2018-01-13
13 2018-01-14
14 2018-01-15
Below is the code that I wrote and it gets the job done. Basically, I split the date string and do some string manipulations. However, as you can see its not too pretty. I am checking to see if there could be some other elegant solution to this other than doing the df.apply
and the loops
.
def func(x):
d = x.split('-')
print(d)
if (int(d[1]) <= 12) & (int(d[2]) <= 12) :
d = [d[0],d[2],d[1]]
x = '-'.join(d)
return x
else:
return x
df['work_date'] = df['work_date'].apply(lambda x:func(x))
Upvotes: 1
Views: 223
Reputation: 16683
You could just update the column based on the fact that it is in order and there is only one date and all days of the year are included consecutively:
df['Date'] = pd.date_range(df['work_date'].min(), '2018-01-12', freq='1D')
# you can specify df['work_date'].min() OR df['work_date'].max) OR A STRING. It really depends on what format your minimum and your maximum is
df
Out[1]:
work_date date
0 2018-01-01 2018-01-01
1 2018-02-01 2018-01-02
2 2018-03-01 2018-01-03
3 2018-04-01 2018-01-04
4 2018-05-01 2018-01-05
5 2018-06-01 2018-01-06
6 2018-07-01 2018-01-07
7 2018-08-01 2018-01-08
8 2018-09-01 2018-01-09
9 2018-10-01 2018-01-10
10 2018-11-01 2018-01-11
11 2018-12-01 2018-01-12
12 2018-01-13 2018-01-13
13 2018-01-14 2018-01-14
14 2018-01-15 2018-01-15
To make this more dynamic, you could also do some try
/ except
shown below:
minn = df['work_date'].min()
maxx = df['work_date'].max()
try:
df['Date'] = pd.date_range(minn, maxx, freq='1D')
except ValueError:
s = maxx.split('-')
df['Date'] = pd.date_range(minn, f'{s[0]}-{s[2]}-{s[1]}', freq='1D')
except ValueError:
s = minn.split('-')
df['Date'] = pd.date_range(f'{s[0]}-{s[2]}-{s[1]}', maxx, freq='1D')
df
Upvotes: 2