Siraj S.
Siraj S.

Reputation: 3751

numpy convert string to datetime when only year information is available

consider the below input:

[['Fiscal data as of Dec 31 2016', '2016', '2015', '2014'],
['Fiscal data as of Mar 31 2016', '2016', '2015', '2014']]   

my desired output is:

[[2016-12-31, 2015-12-31, 2014-12-31],
 [2016-03-31, 2015-03-31, 2014-12-31]]

basically, i want to convert elements 1-3 inside each nested list into datetime object with the month information based on the value of element 0 of that list.

I can think of a manually intensive solution, but I am looking for the most efficient method (speed wise) to achieve this. The actual data has many thousands such rows.

Upvotes: 1

Views: 194

Answers (2)

jezrael
jezrael

Reputation: 862641

You can use extract for months with days, add to eache year from rigth by radd and convert to to_datetime:

L = [['Fiscal data as of Dec 31 2016', '2016', '2015', '2014'],
['Fiscal data as of Mar 31 2016', '2016', '2015', '2014']]   

a = np.array(L)
pat = '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d{1,2})'
d = pd.Series(a[:, 0]).str.extract(pat, expand=True).apply('-'.join, 1).add('-')
print (d)
0    Dec-31-
1    Mar-31-
dtype: object

L1 = pd.DataFrame(a[:, 1:]).radd(d, 0).apply(pd.to_datetime).values.astype('datetime64[D]')
print (L1)
[['2016-12-31' '2015-12-31' '2014-12-31']
 ['2016-03-31' '2015-03-31' '2014-03-31']]

If performance is important use dictionary for mapping months:

d = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 
     'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}

L2 = []
for l in L:
    a = l[0].split()[-3:-1]
    a = '-'.join([d[a[0]], a[1]])
    L2.append([x + '-' + a for x in l[1:]])

print (L2)

[['2016-12-31', '2015-12-31', '2014-12-31'],
 ['2016-03-31', '2015-03-31', '2014-03-31']]

And last if need numpy array:

print (np.array(L1))
[['2016-12-31' '2015-12-31' '2014-12-31']
 ['2016-03-31' '2015-03-31' '2014-03-31']]

Timings:

L = [['Fiscal data as of Dec 31 2016', '2016', '2015', '2014'],
['Fiscal data as of Mar 31 2016', '2016', '2015', '2014']] * 10000  


In [262]: %%timeit
     ...: d = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 
     ...:      'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}
     ...: 
     ...: L2 = []
     ...: for l in L:
     ...:     a = l[0].split()[-3:-1]
     ...:     a = '-'.join([d.get(a[0]), a[1]])
     ...:     L2.append([x + '-' + a for x in l[1:]])
     ...: 
10 loops, best of 3: 44.3 ms per loop

In [263]: %%timeit
     ...: out_list=[]
     ...: for l in L:
     ...:     l_date = datetime.strptime((" ").join(l[0].split()[-3:]), '%b %d %Y')
     ...:     out_list.append([("-").join([str(l_year),str(l_date.month),str(l_date.day)])
     ...:             for l_year in l[-3:]])
     ...: 
1 loop, best of 3: 303 ms per loop

In [264]: %%timeit
     ...: a = np.array(L)
     ...: pat = '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d{1,2})'
     ...: d = pd.Series(a[:, 0]).str.extract(pat, expand=True).apply('-'.join, 1).add('-')
     ...: L1 = pd.DataFrame(a[:, 1:]).radd(d, 0).apply(pd.to_datetime).values.astype('datetime64[D]')
     ...: 
1 loop, best of 3: 7.46 s per loop

Upvotes: 1

atru
atru

Reputation: 4744

This creates your desired output as a nested list

from datetime import datetime

in_list = [['Fiscal data as of Dec 31 2016', '2016', '2015', '2014'],
['Fiscal data as of Mar 31 2016', '2016', '2015', '2014']]

out_list=[]
for l in in_list:
    l_date = datetime.strptime((" ").join(l[0].split()[-3:]), '%b %d %Y')
    out_list.append([("-").join([str(l_year),str(l_date.month),str(l_date.day)])
            for l_year in l[-3:]])

Upvotes: 0

Related Questions