Nick
Nick

Reputation: 3134

Efficiently adding rows to a dataframe

I have a dataset containing historical pricing data for certain number of client IDs. Essentially, it is a spreadsheet with two columns as primary keys (id, p_date): enter image description here Please find below a snippet (first 4 rows) of the dataset:

{'id': {0: '038af19179925da21a25619c5a24b745',
1: '038af19179925da21a25619c5a24b745',
2: '038af19179925da21a25619c5a24b745',
3: '038af19179925da21a25619c5a24b745',
4: '038af19179925da21a25619c5a24b745'},
'p_date': {0: '2015-01-01',
1: '2015-02-01',
2: '2015-03-01',
3: '2015-04-01',
4: '2015-05-01'},
'p_p1_fix': {0: 44.26693116,
1: 44.26693116,
2: 44.26693116,
3: 44.26693116,
4: 44.26693116},
'p_p1_var': {0: 0.151367,
1: 0.151367,
2: 0.151367,
3: 0.149626,
4: 0.149626},
'p_p2_fix': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'p_p2_var': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'p_p3_fix': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'p_p3_var': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}}

There are up to 12 data points per client id corresponding to 12 months of a year. I would like to use the individual pricing data points as separate features in a classification model so need to turn them into separate columns. How to do this most efficiently? The spreadsheet contains data for +20K users. I have written the script below which is taking more than an hour and still hasn't completed:

def transform_hist (hist):
  dates = pd.Series(['2015-01-01',
                         '2015-02-01',
                         '2015-03-01',
                         '2015-04-01',
                         '2015-05-01',
                         '2015-06-01',
                         '2015-07-01',
                         '2015-08-01',
                         '2015-09-01',
                         '2015-10-01',
                         '2015-11-01',
                         '2015-12-01',])
  ids = hist['id'].unique()
  rows_list = []
  for curr_id in ids:
    for curr_date in dates:
        temp = hist[(hist.id == curr_id) & (hist.p_date == curr_date)]
        if len(temp > 0):
            rows_list.append({'id':curr_id, 
                         'p1_var_'+curr_date[5:7]:temp['p_p1_var'],
                         'p2_var_'+curr_date[5:7]:temp['p_p2_var'],
                         'p3_var_'+curr_date[5:7]:temp['p_p3_var'],
                         'p1_fix_'+curr_date[5:7]:temp['p_p1_fix'],
                         'p2_fix_'+curr_date[5:7]:temp['p_p2_fix'],
                         'p3_fix_'+curr_date[5:7]:temp['p_p3_fix']
                        })
  df = pd.DataFrame(rows_list)
  flat_df = df.groupby(['id']).sum()
  return flat_df.reset_index()

The expected output would be a dataframe with 73 columns (id + 12*6 columns each storing one of the 6 prices for each of the 12 data points) and with 1 row per id (currently there are up to 12 rows per id). I'm adding below an example of the expected output for the dataframe snippet above:

{'id': {0: '0002203ffbb812588b632b9e628cc38d'},
'p_p1_fix_01': {0: 40.56596939999999},
'p_p1_fix_02': {0: 40.56596939999999},
'p_p1_fix_03': {0: 40.728885},
'p_p1_fix_04': {0: 40.728885},
'p_p1_fix_05': {0: nan},
'p_p1_fix_06': {0: nan},
'p_p1_fix_07': {0: nan},
'p_p1_fix_08': {0: nan},
'p_p1_fix_09': {0: nan},
'p_p1_fix_10': {0: nan},
'p_p1_fix_11': {0: nan},
'p_p1_fix_12': {0: nan},
'p_p1_var_01': {0: 0.12609800000000002},
'p_p1_var_02': {0: 0.12609800000000002},
'p_p1_var_03': {0: 0.12806700000000001},
'p_p1_var_04': {0: 0.12806700000000001},
'p_p1_var_05': {0: nan},
'p_p1_var_06': {0: nan},
'p_p1_var_07': {0: nan},
'p_p1_var_08': {0: nan},
'p_p1_var_09': {0: nan},
'p_p1_var_10': {0: nan},
'p_p1_var_11': {0: nan},
'p_p1_var_12': {0: nan},
'p_p2_fix_01': {0: 24.33958068},
'p_p2_fix_02': {0: 24.33958068},
'p_p2_fix_03': {0: 24.43733004},
'p_p2_fix_04': {0: 24.43733004},
'p_p2_fix_05': {0: nan},
'p_p2_fix_06': {0: nan},
'p_p2_fix_07': {0: nan},
'p_p2_fix_08': {0: nan},
'p_p2_fix_09': {0: nan},
'p_p2_fix_10': {0: nan},
'p_p2_fix_11': {0: nan},
'p_p2_fix_12': {0: nan},
'p_p2_var_01': {0: 0.103975},
'p_p2_var_02': {0: 0.103975},
'p_p2_var_03': {0: 0.105842},
'p_p2_var_04': {0: 0.105842},
'p_p2_var_05': {0: nan},
'p_p2_var_06': {0: nan},
'p_p2_var_07': {0: nan},
'p_p2_var_08': {0: nan},
'p_p2_var_09': {0: nan},
'p_p2_var_10': {0: nan},
'p_p2_var_11': {0: nan},
'p_p2_var_12': {0: nan},
'p_p3_fix_01': {0: 16.22638872},
'p_p3_fix_02': {0: 16.22638872},
'p_p3_fix_03': {0: 16.29155496},
'p_p3_fix_04': {0: 16.29155496},
'p_p3_fix_05': {0: nan},
'p_p3_fix_06': {0: nan},
'p_p3_fix_07': {0: nan},
'p_p3_fix_08': {0: nan},
'p_p3_fix_09': {0: nan},
'p_p3_fix_10': {0: nan},
'p_p3_fix_11': {0: nan},
'p_p3_fix_12': {0: nan},
'p_p3_var_01': {0: 0.070232},
'p_p3_var_02': {0: 0.070232},
'p_p3_var_03': {0: 0.073773},
'p_p3_var_04': {0: 0.073773},
'p_p3_var_05': {0: nan},
'p_p3_var_06': {0: nan},
'p_p3_var_07': {0: nan},
'p_p3_var_08': {0: nan},
'p_p3_var_09': {0: nan},
'p_p3_var_10': {0: nan},
'p_p3_var_11': {0: nan},
'p_p3_var_12': {0: nan}}

How to re-write the script so that it does the job quicker?

Upvotes: 1

Views: 149

Answers (2)

mcard
mcard

Reputation: 617

Luckily this can be done very simply:

df = df.pivot(index='id', columns='p_date')

This will reshape the df using hierarchical indexing for the columns. To flatten the hierarchy levels into just one, you can use the solution found here:

df.columns = [' '.join(col).strip() for col in df.columns.values]

Which sets the column names to a join of its respective name for each of the hierarchical levels present in the previous step. Now you have a column for each combination of date with the other value columns.

Upvotes: 1

BENY
BENY

Reputation: 323226

If I understand clearly , you are looking for filter + groupby

hist['p_date']=pd.to_datetime(hist['p_date'])
s=hist.loc[hist['p_date'].isin(dates),:]
s.groupby([s['id'],s['p_date'].dt.month]).sum().reset_index()

Upvotes: 0

Related Questions