Reputation: 549
suppose I have a Dataframe like this. I want to convert this to a 2-level multiIndex Dataframe.
dt st close volume
0 20100101 000001.sz 1 10000
1 20100101 000002.sz 10 50000
2 20100101 000003.sz 5 1000
3 20100101 000004.sz 15 7000
4 20100101 000005.sz 100 100000
5 20100102 000001.sz 2 20000
6 20100102 000002.sz 20 60000
7 20100102 000003.sz 6 2000
8 20100102 000004.sz 20 8000
9 20100102 000005.sz 110 110000
But when I try this code:
data = pd.read_csv('data/trial.csv')
print(data)
idx = pd.MultiIndex.from_product([data.dt.unique(),
data.st.unique()],
names=['dt', 'st'])
col = ['close', 'volume']
df = pd.DataFrame(data, idx, col)
print(df)
I find that all the element are NaN
close volume
dt st
20100101 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
20100102 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
How to handle this situation? Thanks.
Upvotes: 4
Views: 548
Reputation: 863031
You need only parameter index_col
in read_csv
:
#by positions of columns
data = pd.read_csv('data/trial.csv', index_col=[0,1])
Or:
#by names of columns
data = pd.read_csv('data/trial.csv', index_col=['dt', 'st'])
print (data)
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
Why all element are all NaN when construct a multiIndex Dataframe?
Reason is in DataFrame
constructor:
df = pd.DataFrame(data, idx, col)
DataFrame
called data
has RangeIndex
and not align with new MultiIndex
, so get NaN
s in data.
Possible solution if always each dt
has same st
values is filter Dataframe by columns names and then convert to numpy array
, but better are index_col
and set_index
solutions:
df = pd.DataFrame(data[col].values, idx, col)
Upvotes: 4
Reputation: 51395
Try using set_index()
like this:
new_df = df.set_index(['dt', 'st'])
Result:
>>> new_df
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
>>> new_df.index
MultiIndex(levels=[[20100101, 20100102], ['000001.sz', '000002.sz', '000003.sz', '000004.sz', '000005.sz']],
labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
names=['dt', 'st'])
Upvotes: 3