Datacrawler
Datacrawler

Reputation: 2886

Convert column headers into a new column and keep the values of each column

UPD: A dataframe has been pasted as an example at the bottom of the page.

My original xls file looks like this:

enter image description here

and I need two actions in order to make it look like the below:

enter image description here

Firstly, I need to fill in the empty row values with the values shown in the cell above them. That has been achieved with the following function:

def get_csv():
    #Read csv file
    df = pd.read_excel('test.xls')
    df = df.fillna(method='ffill')
    return df

Secondly, I have used stack with set_index:

df = (df.set_index(['Country', 'Gender', 'Arr-Dep'])
         .stack()
         .reset_index(name='Value')
         .rename(columns={'level_3':'Year'}))

and I was wondering whether there is an easier way. Is there a library that transforms a dataframe, excel etc into the wanted format?

Original dataframe after excel import:

Country  Gender                          Direction   1974   1975   1976  
0   Austria    Male  IN  13728   8754   9695   
1       NaN     NaN  OUT  17977  12271   9899   
2       NaN  Female  IN   8541   6465   6447   
3       NaN     NaN  OUT   8450   7190   6288   
4       NaN   Total  IN  22269  15219  16142   
5       NaN     NaN  OUT  26427  19461  16187   
6   Belgium    Male  IN   2412   2245   2296   
7       NaN     NaN  OUT   2800   2490   2413   
8       NaN  Female  IN   2105   2022   2057   
9       NaN     NaN  OUT   2100   2113   2004   
10      NaN   Total  IN   4517   4267   4353   
11      NaN     NaN  OUT   4900   4603   4417

Upvotes: 1

Views: 61

Answers (2)

piRSquared
piRSquared

Reputation: 294576

stack

I like your approach. I'd change it in a couple of ways.

  1. use the specific method for foward filling ffill
  2. rename the column axis prior to stacking to avoid the renaming of the column later (personal preference)

df.ffill().set_index(
    ['Country', 'Gender', 'Direction']
).rename_axis('Year', 1).stack().reset_index(name='Value')

    Country  Gender Direction  Year  Value
0   Austria    Male        IN  1974  13728
1   Austria    Male        IN  1975   8754
2   Austria    Male        IN  1976   9695
3   Austria    Male       OUT  1974  17977
4   Austria    Male       OUT  1975  12271
5   Austria    Male       OUT  1976   9899
...

Numpy

I wanted to put together a custom approach. This should be very fast.

def cstm_ffill(s):
  i = np.flatnonzero(s.notna())
  i = np.concatenate([[0], i, [len(s)]])
  d = np.diff(i)
  a = s.values[i[:-1].repeat(d)]
  return a

def cstm_melt(df):
  c = cstm_ffill(df.Country)
  g = cstm_ffill(df.Gender)
  d = cstm_ffill(df.Direction)
  y = df.columns[3:].values

  k = len(y)

  i = np.column_stack([c, g, d])
  v = np.column_stack([*map(df.get, y)]).ravel()

  df_ = pd.DataFrame(
      np.column_stack([i.repeat(k, axis=0), v, np.tile(y, len(i))]),
      columns=['Country', 'Gender', 'Direction', 'Year', 'Value']
  )
  return df_

cstm_melt(df)

    Country  Gender Direction   Year Value
0   Austria    Male        IN  13728  1974
1   Austria    Male        IN   8754  1975
2   Austria    Male        IN   9695  1976
3   Austria    Male       OUT  17977  1974
4   Austria    Male       OUT  12271  1975
5   Austria    Male       OUT   9899  1976
...

Upvotes: 2

jezrael
jezrael

Reputation: 863791

Alternative solution is use melt, but if need same ordering of columns like stacked DataFrame is necessary add sort_values:

df1 = (df.ffill()
         .melt(id_vars=['Country','Gender','Direction'],var_name="Date",value_name='Value')
)

print (df1)
    Country  Gender Direction  Date  Value
0   Austria    Male        IN  1974  13728
1   Austria    Male       OUT  1974  17977
2   Austria  Female        IN  1974   8541
3   Austria  Female       OUT  1974   8450
4   Austria   Total        IN  1974  22269
5   Austria   Total       OUT  1974  26427
6   Belgium    Male        IN  1974   2412
7   Belgium    Male       OUT  1974   2800
8   Belgium  Female        IN  1974   2105
9   Belgium  Female       OUT  1974   2100
10  Belgium   Total        IN  1974   4517
11  Belgium   Total       OUT  1974   4900
12  Austria    Male        IN  1975   8754
13  Austria    Male       OUT  1975  12271
14  Austria  Female        IN  1975   6465
15  Austria  Female       OUT  1975   7190
16  Austria   Total        IN  1975  15219
17  Austria   Total       OUT  1975  19461
18  Belgium    Male        IN  1975   2245
19  Belgium    Male       OUT  1975   2490
20  Belgium  Female        IN  1975   2022
21  Belgium  Female       OUT  1975   2113
22  Belgium   Total        IN  1975   4267
23  Belgium   Total       OUT  1975   4603
24  Austria    Male        IN  1976   9695
25  Austria    Male       OUT  1976   9899
26  Austria  Female        IN  1976   6447
27  Austria  Female       OUT  1976   6288
28  Austria   Total        IN  1976  16142
29  Austria   Total       OUT  1976  16187
30  Belgium    Male        IN  1976   2296
...
...

df1 = (df.ffill()
         .melt(id_vars=['Country','Gender','Direction'],var_name="Date", value_name='Value')
         .sort_values(['Country', 'Gender','Direction'])
         .reset_index(drop=True))

print (df1)
    Country  Gender Direction  Date  Value
0   Austria  Female        IN  1974   8541
1   Austria  Female        IN  1975   6465
2   Austria  Female        IN  1976   6447
3   Austria  Female       OUT  1974   8450
4   Austria  Female       OUT  1975   7190
5   Austria  Female       OUT  1976   6288
6   Austria    Male        IN  1974  13728
7   Austria    Male        IN  1975   8754
8   Austria    Male        IN  1976   9695
9   Austria    Male       OUT  1974  17977
10  Austria    Male       OUT  1975  12271
11  Austria    Male       OUT  1976   9899
12  Austria   Total        IN  1974  22269
13  Austria   Total        IN  1975  15219
14  Austria   Total        IN  1976  16142
15  Austria   Total       OUT  1974  26427
16  Austria   Total       OUT  1975  19461
17  Austria   Total       OUT  1976  16187
18  Belgium  Female        IN  1974   2105
19  Belgium  Female        IN  1975   2022
20  Belgium  Female        IN  1976   2057
21  Belgium  Female       OUT  1974   2100
22  Belgium  Female       OUT  1975   2113
23  Belgium  Female       OUT  1976   2004
24  Belgium    Male        IN  1974   2412
25  Belgium    Male        IN  1975   2245
26  Belgium    Male        IN  1976   2296
27  Belgium    Male       OUT  1974   2800
28  Belgium    Male       OUT  1975   2490
29  Belgium    Male       OUT  1976   2413
30  Belgium   Total        IN  1974   4517
...
...

Upvotes: 2

Related Questions