chrise
chrise

Reputation: 4253

pandas applying multicolumnindex to dataframe

The situation is that I have a few files with time_series data for various stocks with several fields. each file contains

time, open, high, low, close, volume

the goal is to get that all into one dataframe of the form

field      open                              high                            ...
security    hk_1      hk_2      hk_3 ...      hk_1      hk_2      hk_3 ...  ...
time
t_1      open_1_1  open_2_1  open_3_1 ...  high_1_1  high_2_1  high_3_1 ...  ...            
t_2      open_1_2  open_2_2  open_3_2 ...  high_1_2  high_2_2  high_3_2 ...  ...
...        ...        ...       ... ...       ...       ...       ... ...  ...

I created a multiindex

fields = ['time','open','high','low','close','volume','numEvents','value']
midx = pd.MultiIndex.from_product([security_name'], fields], names=['security', 'field'])

and for a start, tried to apply that multiindex to the dataframe I get from reading the data from csv (by creating a new dataframe and adding the index)

for c in eqty_names_list:

    midx = pd.MultiIndex.from_product([[c], fields], names=['security', 'field'])

    df_temp = pd.read_csv('{}{}.csv'.format(path, c))
    df_temp = pd.DataFrame(df_temp, columns=midx, index=df_temp['time'])
    df_temp.df_name = c
    all_dfs.append(df_temp)

However, the new dataframe only contains nan

security    1_HK
field       time    open    high    low     close   volume
time                                
 NaN         NaN     NaN     NaN    NaN       NaN      NaN

Also, it still contains a column for time, although I tried to make that the index (so that I can later join all the other dataframes for other stocks by index to get the aggregated dataframe).

How can I apply the multiindex to the dataframe without losing my data and then later join the dataframes looking like this

security    1_HK
field       time    open    high    low     close   volume
time

to create something like (note hierarchy field and security are switched)

field       time                open    high        ...
security    1_HK    2_HK ...    1_HK    2_HK ...    ...
time

Upvotes: 2

Views: 208

Answers (1)

jezrael
jezrael

Reputation: 863156

I think you can first get all files to list files, then with list comprehension get all DataFrames and concat them by columns (axis=1). If add parameter keys, you get Multiindex in columns:

Files:

a.csv, b.csv, c.csv

import pandas as pd
import glob

files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp) for fp in files]

eqty_names_list = ['hk1','hk2','hk3']
df = pd.concat(dfs, keys=eqty_names_list, axis=1)

print (df)
  hk1       hk2       hk3      
    a  b  c   a  b  c   a  b  c
0   0  1  2   0  9  6   0  7  1
1   1  5  8   1  6  4   1  3  2

Last need swaplevel and sort_index:

df.columns = df.columns.swaplevel(0,1)
df = df.sort_index(axis=1)
print (df)
    a           b           c        
  hk1 hk2 hk3 hk1 hk2 hk3 hk1 hk2 hk3
0   0   0   0   1   9   7   2   6   1
1   1   1   1   5   6   3   8   4   2

Upvotes: 2

Related Questions