Reputation: 1241
I have a dataframe with over 41500 records and 3 fields: ID
,start_date
and end_date
.
I want to create a separate dataframe out of it with just 2 fields as: ID
and active_years
which will contain records having each identifiers against all the possible years that exists between the start_year and end_year range (inclusive of end year in the range).
This is what I'm doing right now, but for 41500 rows it takes more than 2 hours to finish.
df = pd.DataFrame(columns=['id', 'active_years'])
ix = 0
for _, row in raw_dataset.iterrows():
st_yr = int(row['start_date'].split('-')[0]) # because dates are in the format yyyy-mm-dd
end_yr = int(row['end_date'].split('-')[0])
for year in range(st_yr, end_yr+1):
df.loc[ix, 'id'] = row['ID']
df.loc[ix, 'active_years'] = year
ix = ix + 1
So is there any faster way to achieve this?
[EDIT] some examples to try and work around,
raw_dataset = pd.DataFrame({'ID':['a121','b142','cd3'],'start_date':['2019-10-09','2017-02-06','2012-12-05'],'end_date':['2020-01-30','2019-08-23','2016-06-18']})
print(raw_dataset)
ID start_date end_date
0 a121 2019-10-09 2020-01-30
1 b142 2017-02-06 2019-08-23
2 cd3 2012-12-05 2016-06-18
# the desired dataframe should look like this
print(desired_df)
id active_years
0 a121 2019
1 a121 2020
2 b142 2017
3 b142 2018
4 b142 2019
5 cd3 2012
6 cd3 2013
7 cd3 2014
8 cd3 2015
9 cd3 2016
Upvotes: 0
Views: 646
Reputation: 8634
Dynamically growing python lists is much faster than dynamically growing numpy arrays (which are the underlying data structure of pandas dataframes). See here for a brief explanation. With that in mind:
import pandas as pd
# Initialize input dataframe
raw_dataset = pd.DataFrame({
'ID':['a121','b142','cd3'],
'start_date':['2019-10-09','2017-02-06','2012-12-05'],
'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})
# Create integer columns for start year and end year
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year
# Iterate over input dataframe rows and individual years
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
for year in range(row.start_year, row.end_year+1):
id_list.append(row.ID)
active_years_list.append(year)
# Create result dataframe from lists
desired_df = pd.DataFrame({
'id': id_list,
'active_years': active_years_list,
})
print(desired_df)
# Output:
# id active_years
# 0 a121 2019
# 1 a121 2020
# 2 b142 2017
# 3 b142 2018
# 4 b142 2019
# 5 cd3 2012
# 6 cd3 2013
# 7 cd3 2014
# 8 cd3 2015
# 9 cd3 2016
Upvotes: 2