Reputation: 4545
I have a csv file in the following format:
| a | b | 2018 | 2018 | 2019 | 2019 |
| | | jan | feb | jan | feb |
---------------------------------------
| a1 | b1 | 0 | 1 | 2 | 3 |
| a1 | b2 | 4 | 5 | 6 | 7 |
| a2 | b1 | 8 | 9 | 10 | 11 |
| a2 | b2 | 12 | 13 | 14 | 15 |
I would like to read it into a pandas DF, and then melt it to the following format:
| a | b | year | month | value |
----------------------------------
| a1 | b1 | 2018 | jan | 0 |
| a1 | b1 | 2018 | feb | 1 |
| a1 | b1 | 2019 | jan | 2 |
| a1 | b1 | 2019 | feb | 3 |
| a1 | b2 | 2018 | jan | 4 |
| a1 | b2 | 2018 | feb | 5 |
| a1 | b2 | 2019 | jan | 6 |
| a1 | b2 | 2019 | feb | 7 |
| a2 | b1 | 2018 | jan | 8 |
| a2 | b1 | 2018 | feb | 9 |
| a2 | b1 | 2019 | jan | 10 |
| a2 | b1 | 2019 | feb | 11 |
| a2 | b2 | 2018 | jan | 12 |
| a2 | b2 | 2018 | feb | 13 |
| a2 | b2 | 2019 | jan | 14 |
| a2 | b2 | 2019 | feb | 15 |
How can this be achieved?
Upvotes: 3
Views: 6193
Reputation: 166
@KOB my answer can fit generally any csv file with 2 rows header, where part of columns are only on first row, and part on both first and second row. Based on your question this code will put correctly all headers as requested. while reading csv and a MulitIndex dataframe created:
df_multiidx = pd.read_csv('two_levels_header_file.csv', header=[0,1])
id_vars = [idv for idv in df_multiidx.columns if 'Unnamed' in idv[1]]
value_vars = [valv for valv in df_multiidx.columns if 'Unnamed' not in valv[1]]
df_multiidx= df_multiidx.melt(id_vars=id_vars, value_vars=value_vars,var_name=['year','month'])
df_multiidx.rename(columns={col_ren:col_ren[0] for col_ren in id_vars})
Output:
a b year month value
0 a1 b1 2018 jan 0
1 a1 b2 2018 jan 4
2 a2 b1 2018 jan 8
3 a2 b2 2018 jan 12
4 a1 b1 2018 feb 1
5 a1 b2 2018 feb 5
6 a2 b1 2018 feb 9
7 a2 b2 2018 feb 13
8 a1 b1 2019 jan 2
9 a1 b2 2019 jan 6
10 a2 b1 2019 jan 10
11 a2 b2 2019 jan 14
12 a1 b1 2019 feb 3
13 a1 b2 2019 feb 7
14 a2 b1 2019 feb 11
15 a2 b2 2019 feb 15
Upvotes: 2
Reputation: 1430
In case of plain dataframe, this should work:
import pandas as pd
df = pd.DataFrame({
'a': ['a1', 'a1', 'a2', 'a2',],
'b': ['b1', 'b2', 'b2', 'b2',],
'2018 jan': [0, 4, 8, 12],
'2018 feb': [1, 5, 9, 13],
'2019 jan': [2, 6, 10, 14],
'2019 feb': [3, 7, 11, 15],
})
df = df.melt(id_vars=['a', 'b'], var_name='date', value_name='value')
df['date'] = df['date'].str.split(' ')
df['year'] = df['date'].str[0]
df['month'] = df['date'].str[1]
df.drop(columns='date', inplace=True)
Output:
a b value year month
0 a1 b1 0 2018 jan
1 a1 b2 4 2018 jan
2 a2 b2 8 2018 jan
3 a2 b2 12 2018 jan
4 a1 b1 1 2018 feb
5 a1 b2 5 2018 feb
6 a2 b2 9 2018 feb
7 a2 b2 13 2018 feb
8 a1 b1 2 2019 jan
9 a1 b2 6 2019 jan
10 a2 b2 10 2019 jan
11 a2 b2 14 2019 jan
12 a1 b1 3 2019 feb
13 a1 b2 7 2019 feb
14 a2 b2 11 2019 feb
15 a2 b2 15 2019 feb
If you have some multi-index in columns as mentioned in comment, here you can convert it to plain dataframe:
df = pd.read_csv('file.csv', header=[0,1])
df.columns = [' '.join(col).strip() for col in df.columns.values]
df.rename(columns={'a Unnamed: 0_level_1': 'a', 'b Unnamed: 1_level_1': 'b'}, inplace=True)
Upvotes: 4