Reputation: 2886
UPD: A dataframe has been pasted as an example at the bottom of the page.
My original xls file looks like this:
and I need two actions in order to make it look like the below:
Firstly, I need to fill in the empty row values with the values shown in the cell above them. That has been achieved with the following function:
def get_csv():
#Read csv file
df = pd.read_excel('test.xls')
df = df.fillna(method='ffill')
return df
Secondly, I have used stack
with set_index
:
df = (df.set_index(['Country', 'Gender', 'Arr-Dep'])
.stack()
.reset_index(name='Value')
.rename(columns={'level_3':'Year'}))
and I was wondering whether there is an easier way. Is there a library that transforms a dataframe, excel etc into the wanted format?
Original dataframe after excel import:
Country Gender Direction 1974 1975 1976
0 Austria Male IN 13728 8754 9695
1 NaN NaN OUT 17977 12271 9899
2 NaN Female IN 8541 6465 6447
3 NaN NaN OUT 8450 7190 6288
4 NaN Total IN 22269 15219 16142
5 NaN NaN OUT 26427 19461 16187
6 Belgium Male IN 2412 2245 2296
7 NaN NaN OUT 2800 2490 2413
8 NaN Female IN 2105 2022 2057
9 NaN NaN OUT 2100 2113 2004
10 NaN Total IN 4517 4267 4353
11 NaN NaN OUT 4900 4603 4417
Upvotes: 1
Views: 61
Reputation: 294576
stack
I like your approach. I'd change it in a couple of ways.
ffill
df.ffill().set_index(
['Country', 'Gender', 'Direction']
).rename_axis('Year', 1).stack().reset_index(name='Value')
Country Gender Direction Year Value
0 Austria Male IN 1974 13728
1 Austria Male IN 1975 8754
2 Austria Male IN 1976 9695
3 Austria Male OUT 1974 17977
4 Austria Male OUT 1975 12271
5 Austria Male OUT 1976 9899
...
I wanted to put together a custom approach. This should be very fast.
def cstm_ffill(s):
i = np.flatnonzero(s.notna())
i = np.concatenate([[0], i, [len(s)]])
d = np.diff(i)
a = s.values[i[:-1].repeat(d)]
return a
def cstm_melt(df):
c = cstm_ffill(df.Country)
g = cstm_ffill(df.Gender)
d = cstm_ffill(df.Direction)
y = df.columns[3:].values
k = len(y)
i = np.column_stack([c, g, d])
v = np.column_stack([*map(df.get, y)]).ravel()
df_ = pd.DataFrame(
np.column_stack([i.repeat(k, axis=0), v, np.tile(y, len(i))]),
columns=['Country', 'Gender', 'Direction', 'Year', 'Value']
)
return df_
cstm_melt(df)
Country Gender Direction Year Value
0 Austria Male IN 13728 1974
1 Austria Male IN 8754 1975
2 Austria Male IN 9695 1976
3 Austria Male OUT 17977 1974
4 Austria Male OUT 12271 1975
5 Austria Male OUT 9899 1976
...
Upvotes: 2
Reputation: 863791
Alternative solution is use melt
, but if need same ordering of columns like stacked
DataFrame
is necessary add sort_values
:
df1 = (df.ffill()
.melt(id_vars=['Country','Gender','Direction'],var_name="Date",value_name='Value')
)
print (df1)
Country Gender Direction Date Value
0 Austria Male IN 1974 13728
1 Austria Male OUT 1974 17977
2 Austria Female IN 1974 8541
3 Austria Female OUT 1974 8450
4 Austria Total IN 1974 22269
5 Austria Total OUT 1974 26427
6 Belgium Male IN 1974 2412
7 Belgium Male OUT 1974 2800
8 Belgium Female IN 1974 2105
9 Belgium Female OUT 1974 2100
10 Belgium Total IN 1974 4517
11 Belgium Total OUT 1974 4900
12 Austria Male IN 1975 8754
13 Austria Male OUT 1975 12271
14 Austria Female IN 1975 6465
15 Austria Female OUT 1975 7190
16 Austria Total IN 1975 15219
17 Austria Total OUT 1975 19461
18 Belgium Male IN 1975 2245
19 Belgium Male OUT 1975 2490
20 Belgium Female IN 1975 2022
21 Belgium Female OUT 1975 2113
22 Belgium Total IN 1975 4267
23 Belgium Total OUT 1975 4603
24 Austria Male IN 1976 9695
25 Austria Male OUT 1976 9899
26 Austria Female IN 1976 6447
27 Austria Female OUT 1976 6288
28 Austria Total IN 1976 16142
29 Austria Total OUT 1976 16187
30 Belgium Male IN 1976 2296
...
...
df1 = (df.ffill()
.melt(id_vars=['Country','Gender','Direction'],var_name="Date", value_name='Value')
.sort_values(['Country', 'Gender','Direction'])
.reset_index(drop=True))
print (df1)
Country Gender Direction Date Value
0 Austria Female IN 1974 8541
1 Austria Female IN 1975 6465
2 Austria Female IN 1976 6447
3 Austria Female OUT 1974 8450
4 Austria Female OUT 1975 7190
5 Austria Female OUT 1976 6288
6 Austria Male IN 1974 13728
7 Austria Male IN 1975 8754
8 Austria Male IN 1976 9695
9 Austria Male OUT 1974 17977
10 Austria Male OUT 1975 12271
11 Austria Male OUT 1976 9899
12 Austria Total IN 1974 22269
13 Austria Total IN 1975 15219
14 Austria Total IN 1976 16142
15 Austria Total OUT 1974 26427
16 Austria Total OUT 1975 19461
17 Austria Total OUT 1976 16187
18 Belgium Female IN 1974 2105
19 Belgium Female IN 1975 2022
20 Belgium Female IN 1976 2057
21 Belgium Female OUT 1974 2100
22 Belgium Female OUT 1975 2113
23 Belgium Female OUT 1976 2004
24 Belgium Male IN 1974 2412
25 Belgium Male IN 1975 2245
26 Belgium Male IN 1976 2296
27 Belgium Male OUT 1974 2800
28 Belgium Male OUT 1975 2490
29 Belgium Male OUT 1976 2413
30 Belgium Total IN 1974 4517
...
...
Upvotes: 2