giga
giga

Reputation: 327

Handle multiple date formats in pandas dataframe

I have a dataframe (imported from Excel) which looks like this:

         Date               Period  
0  2017-03-02  2017-03-01 00:00:00  
1  2017-03-02  2017-04-01 00:00:00     
2  2017-03-02  2017-05-01 00:00:00    
3  2017-03-02  2017-06-01 00:00:00    
4  2017-03-02  2017-07-01 00:00:00      
5  2017-03-02  2017-08-01 00:00:00   
6  2017-03-02  2017-09-01 00:00:00    
7  2017-03-02  2017-10-01 00:00:00  
8  2017-03-02  2017-11-01 00:00:00 
9  2017-03-02  2017-12-01 00:00:00 
10 2017-03-02                 Q217 
11 2017-03-02                 Q317  
12 2017-03-02                 Q417 
13 2017-03-02                 Q118 
14 2017-03-02                 Q218 
15 2017-03-02                 Q318 
16 2017-03-02                 Q418 
17 2017-03-02                 2018   

I am trying to convert all the 'Period' column into a consistent format. Some elements look already in the datetime format, others are converted to string (ex. Q217), others to int (ex 2018). Which is the fastest way to convert everything in a datetime? I was trying with some masking, like this:

mask = df['Period'].str.startswith('Q', na = False)
list_quarter = df_final[mask]['Period'].tolist()
quarter_convert = {'1':'31/03', '2':'30/06', '3':'31/08', '4':'30/12'}
counter = 0
for element in list_quarter:
    element = element[1:]
    quarter = element[0]
    year = element[1:]
    daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
    final = daymonth+'/'+year
    list_quarter[counter] = final
    counter+=1

However it fails when I try to substitute the modified elements in the original column:

df_nwe_final['Period'] = np.where(mask, pd.Series(list_quarter), df_nwe_final['Period'])

Of course I would need to do more or less the same with the 2018 type formats. However, I am sure I am missing something here, and there should be a much faster solution. Some fresh ideas from you would help! Thank you.

Upvotes: 0

Views: 1055

Answers (1)

joris
joris

Reputation: 139312

Reusing the code you show, let's first write a function that converts the Q-string to a datetime format (I adjusted to final format a little bit):

def convert_q_string(element):
    quarter_convert = {'1':'03-31', '2':'06-30', '3':'08-31', '4':'12-30'}
    element = element[1:]
    quarter = element[0]
    year = element[1:]
    daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
    final = '20' + year + '-' + daymonth
    return final

We can now use this to first convert all 'Q'-strings, and then pd.to_datetime to convert all elements to proper datetime values:

In [2]: s = pd.Series(['2017-03-01 00:00:00', 'Q217', '2018'])

In [3]: mask = s.str.startswith('Q')

In [4]: s[mask] = s[mask].map(convert_q_string)

In [5]: s
Out[5]: 
0    2017-03-01 00:00:00
1             2017-06-30
2                   2018
dtype: object

In [6]: pd.to_datetime(s)
Out[6]: 
0   2017-03-01
1   2017-06-30
2   2018-01-01
dtype: datetime64[ns]

Upvotes: 1

Related Questions