xxyao
xxyao

Reputation: 549

Why all element are all NaN when construct a multiIndex Dataframe

suppose I have a Dataframe like this. I want to convert this to a 2-level multiIndex Dataframe.

         dt         st  close  volume
0   20100101  000001.sz      1   10000
1   20100101  000002.sz     10   50000
2   20100101  000003.sz      5    1000
3   20100101  000004.sz     15    7000
4   20100101  000005.sz    100  100000
5   20100102  000001.sz      2   20000
6   20100102  000002.sz     20   60000
7   20100102  000003.sz      6    2000
8   20100102  000004.sz     20    8000
9   20100102  000005.sz    110  110000

But when I try this code:

data = pd.read_csv('data/trial.csv')
print(data)
idx = pd.MultiIndex.from_product([data.dt.unique(),
                                  data.st.unique()],
                                 names=['dt', 'st'])
col = ['close', 'volume']

df = pd.DataFrame(data, idx, col)
print(df)

I find that all the element are NaN

                    close  volume
dt       st                      
20100101 000001.sz    NaN     NaN
         000002.sz    NaN     NaN
         000003.sz    NaN     NaN
         000004.sz    NaN     NaN
         000005.sz    NaN     NaN
20100102 000001.sz    NaN     NaN
         000002.sz    NaN     NaN
         000003.sz    NaN     NaN
         000004.sz    NaN     NaN
         000005.sz    NaN     NaN

How to handle this situation? Thanks.

Upvotes: 4

Views: 548

Answers (2)

jezrael
jezrael

Reputation: 863031

You need only parameter index_col in read_csv:

#by positions of columns
data = pd.read_csv('data/trial.csv', index_col=[0,1])

Or:

#by names of columns
data = pd.read_csv('data/trial.csv', index_col=['dt', 'st'])

print (data)
                    close  volume
dt       st                      
20100101 000001.sz      1   10000
         000002.sz     10   50000
         000003.sz      5    1000
         000004.sz     15    7000
         000005.sz    100  100000
20100102 000001.sz      2   20000
         000002.sz     20   60000
         000003.sz      6    2000
         000004.sz     20    8000
         000005.sz    110  110000

Why all element are all NaN when construct a multiIndex Dataframe?

Reason is in DataFrame constructor:

df = pd.DataFrame(data, idx, col)

DataFrame called data has RangeIndex and not align with new MultiIndex, so get NaNs in data.

Possible solution if always each dt has same st values is filter Dataframe by columns names and then convert to numpy array, but better are index_col and set_index solutions:

df = pd.DataFrame(data[col].values, idx, col)

Upvotes: 4

sacuL
sacuL

Reputation: 51395

Try using set_index() like this:

new_df = df.set_index(['dt', 'st'])

Result:

>>> new_df

                    close  volume
dt       st                      
20100101 000001.sz      1   10000
         000002.sz     10   50000
         000003.sz      5    1000
         000004.sz     15    7000
         000005.sz    100  100000
20100102 000001.sz      2   20000
         000002.sz     20   60000
         000003.sz      6    2000
         000004.sz     20    8000
         000005.sz    110  110000

>>> new_df.index
MultiIndex(levels=[[20100101, 20100102], ['000001.sz', '000002.sz', '000003.sz', '000004.sz', '000005.sz']],
           labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
           names=['dt', 'st'])

Upvotes: 3

Related Questions