The Great
The Great

Reputation: 7723

Elegant and Efficient way to map dates from one dataframe to another - Big data

I have two dataframes as given below

t1 = pd.DataFrame({'person_id':[1,2,3],'observation_date':[np.nan,np.nan,np.nan],'observation_datetime':[np.nan,np.nan,np.nan]})

t2 = pd.DataFrame({'person_id':[1,2,3],'value_as_string':['5/28/2007','5/30/2007','6/4/2007']}).set_index('person_id')['value_as_string']

They look like as shown below

enter image description here

This is what I tried to get the output

t1['observation_date'] = t1['person_id'].map(t2)
t1['observation_date'] = pd.to_datetime(t1['observation_date'])
t1['observation_datetime'] = pd.to_datetime(t1['observation_date']).dt.strftime('%m/%d/%Y %H:%M:%S')

Though this works fine, it takes lot of time in real data

Please note that I am trying to do this on t1 dataframe of size 1 million records and t2 dataframe of size of 15k records. So any efficient approach would be helpful

I expect my output dataframe to look like as shown below

enter image description here

Upvotes: 4

Views: 1712

Answers (3)

Mohsin hasan
Mohsin hasan

Reputation: 837

Converting to datetime format takes a lot of time as well, you can speed that up by explicitly specifying datetime format as an argument to pd.to_datetime. It can give up to 10x improvement for you case.

Simulating your case.

import pandas as pd

t1 = pd.DataFrame({'person_id':[i for i in range(1000000)],'observation_date':[np.nan]*1000000,'observation_datetime':[np.nan]*1000000})
t2 = pd.DataFrame({'person_id':np.random.choice(1000000, replace=False, size=15000),
                   'value_as_string':['5/28/2007','5/30/2007','6/4/2007']*5000}).set_index('person_id')['value_as_string']


def map_infere_datetime_format(t1, t2):
  t1['observation_date'] = t1['person_id'].map(t2)
  t1['observation_date'] = pd.to_datetime(t1['observation_date'])
  t1['observation_datetime'] = pd.to_datetime(t1['observation_date']).dt.strftime('%m/%d/%Y %H:%M:%S')
  return t1

# explicitly specify format instead of pandas doing the work for you
def map_explicit_datetime_format(t1, t2):
  t1['observation_date'] = t1['person_id'].map(t2)
  t1['observation_date'] = pd.to_datetime(t1['observation_date'], format='%m/%d/%Y')
  t1['observation_datetime'] = t1['observation_date'].dt.strftime('%m/%d/%Y %H:%M:%S')
  return t1

Test results running on google colab:

%%timeit -n3
map_infere_datetime_format(t1, t2)
# 3 loops, best of 3: 2.04 s per loop

%%timeit -n3
map_explicit_datetime_format(t1, t2)
# 3 loops, best of 3: 290 ms per loop

Since, t2 is small it would make sense to convert t2 to datetime before mapping for a small speedup.

Hope it helps!

Upvotes: 1

J. Doe
J. Doe

Reputation: 3634

ids = list(range(1, 15000))
dte = ['5/28/2007','5/30/2007','6/4/2007'] * 5000
t1 = pd.DataFrame({'person_id': ids})
t2 = pd.DataFrame({'person_id': ids, 
                   'value_as_string': dte)

Merge approach

x = t1.merge(t2, how='left', on='person_id', how='left')
# 5.19 ms ± 408 µs per loop

Join approach

x = t1.set_index('person_id').join(df2.set_index('person_id'), how='left') 
# 3.02 ms ± 91.4 µs per loop

Map approach with dict

t1['observation_date'] = t1['person_id'].map(
       t2.set_index('person_id')['value_as_string'].to_dict())
# 2.73 ms ± 240 µs per loop

Map approach without dict

t1['observation_date'] = t1['person_id'].map(t2.set_index('person_id')['value_as_string'])
# 2.33 ms ± 260 µs per loop

So

t1['observation_date'] = pd.to_datetime(
        t1['person_id'].map(t2.set_index('person_id')['value_as_string']))
t1['observation_datetime'] = t1['observation_date'].dt.strftime('%m/%d/%Y %H:%M:%S')

Upvotes: 4

Heathcliff
Heathcliff

Reputation: 11

I have a workaround for your problem. Instead of mapping, why don't you use a faster approach like a merge in pandas? I have used it on records ranging in close to a million and it is surprisingly fast.

The process of merging begins with two dataframes. Try doing

df =  t1.merge(t2, on = 'person_id', how='inner')

this would do an inner join on person_id on both columns in both dataframes(t1 and t2). You would get a new column introduced in resultant dataframe. You can then use simple column manipulation to fill the value in the target column.

Hope that helped.

Upvotes: 1

Related Questions