Codutie
Codutie

Reputation: 1075

data manipulation example from wide to long in python

I've just placed a similar question here and got an answer but recognised, that by adding a new column to a DataFrame the presented solution fails as the problem is a bit different.

I want to go from here:

import pandas as pd

df = pd.DataFrame({'ID': [1, 2],
                   'Value_2013': [100, 200],
                   'Value_2014': [245, 300],
                   'Value_2016': [200, float('NaN')]})

print(df)

    ID  Value_2013  Value_2014  Value_2016
0   1         100         245       200.0
1   2         200         300         NaN

to:

df_new = pd.DataFrame({'ID': [1, 1, 1, 2, 2],
                       'Year': [2013, 2014, 2016, 2013, 2014],
                       'Value': [100, 245, 200, 200, 300]})

print(df_new)

    ID  Value  Year
0   1    100  2013
1   1    245  2014
2   1    200  2016
3   2    200  2013
4   2    300  2014

Any ideas how I can face this challenge?

Upvotes: 4

Views: 5034

Answers (6)

sammywemmy
sammywemmy

Reputation: 28709

One option is with the pivot_longer function from pyjanitor, using the .value placeholder:

# pip install pyjanitor
import pandas as pd
import janitor

(df
.pivot_longer(
   index = "ID", 
   names_to=(".value", "Year"), 
   names_sep="_", 
   sort_by_appearance=True)
.dropna()
)

   ID  Year  Value
0   1  2013  100.0
1   1  2014  245.0
2   1  2016  200.0
3   2  2013  200.0
4   2  2014  300.0

The .value keeps the part of the column associated with it as header, while the rest goes into the Year column.

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

Yet another soution (two steps):

In [31]: x = df.set_index('ID').stack().astype(int).reset_index(name='Value')

In [32]: x
Out[32]:
   ID     level_1  Value
0   1  Value_2013    100
1   1  Value_2014    245
2   1  Value_2016    200
3   2  Value_2013    200
4   2  Value_2014    300

In [33]: x = x.assign(Year=x.pop('level_1').str.extract(r'(\d{4})', expand=False))

In [34]: x
Out[34]:
   ID  Value  Year
0   1    100  2013
1   1    245  2014
2   1    200  2016
3   2    200  2013
4   2    300  2014

Upvotes: 0

Martin Valgur
Martin Valgur

Reputation: 6322

The pandas.melt() method gets you halfway there. After that it's just some minor cleaning up.

df = pd.melt(df, id_vars='ID', var_name='Year', value_name='Value')
df['Year'] = df['Year'].map(lambda x: x.split('_')[1])
df = df.dropna().astype(int).sort_values(['ID', 'Year']).reset_index(drop=True)
df = df.reindex_axis(['ID', 'Value', 'Year'], axis=1)

print(df)
   ID  Value  Year
0   1    100  2013
1   1    245  2014
2   1    200  2016
3   2    200  2013
4   2    300  2014

Upvotes: 4

Alexander
Alexander

Reputation: 1345

Another option is pd.wide_to_long(). Admittedly it doesn't give you exactly the same output but you can clean up as needed.

pd.wide_to_long(df, ['Value_',], i='', j='Year')

          ID  Value_
    Year            
NaN 2013   1     100
    2013   2     200
    2014   1     245
    2014   2     300
    2016   1     200
    2016   2     NaN

Upvotes: 1

Shivam Gaur
Shivam Gaur

Reputation: 1062

Leveraging Multi Indexing in Pandas

import numpy as np
import pandas as pd
from collections import OrderedDict

df = pd.DataFrame({'ID': [1, 2],
                   'Value_2013': [100, 200],
                   'Value_2014': [245, 300],
                   'Value_2016': [200, float('NaN')]})


# Set ID column as Index
df = df.set_index('ID')

# unstack all columns, swap the levels in the row index 
# and convert series to df
df = df.unstack().swaplevel().to_frame().reset_index()
# Rename columns as desired
df.columns = ['ID', 'Year', 'Value']

# Transform the year values from Value_2013 --> 2013 and so on
df['Year'] = df['Year'].apply(lambda x : x.split('_')[1]).astype(np.int)

# Sort by ID
df = df.sort_values(by='ID').reset_index(drop=True).dropna()

print(df)
   ID  Year  Value
0   1  2013  100.0
1   1  2014  245.0
2   1  2016  200.0
3   2  2013  200.0
4   2  2014  300.0

Upvotes: 1

jezrael
jezrael

Reputation: 863166

You need add set_index first:

df = df.set_index('ID')
df.columns = df.columns.str.split('_', expand=True)
df = df.stack().rename_axis(['ID','Year']).reset_index()
df.Value = df.Value.astype(int)
#if order of columns is important
df = df.reindex_axis(['ID','Value','Year'], axis=1)
print (df)
   ID  Value  Year
0   1    100  2013
1   1    245  2014
2   1    200  2016
3   2    200  2013
4   2    300  2014

Upvotes: 3

Related Questions