Dibyendu Dey
Dibyendu Dey

Reputation: 369

split and group in panda dataframe

Suppose I have Panda DataFrame which looks following:

Input

Name    Key Val
David   A   1
Roe     B   2
John    A   3
Nat     B   4

I want to split by Key and Group by Name.

Output

Name    A   B
David   1   nan
John    3   nan
Nat     nan  4
Roe     nan  2

Can you please suggest a way to do it?

Below is code to generate the dataframe.

import pandas as pd
# Initializing the nested list with Data-set
data = [['David','A',1],
        ['Roe','B',2],
        ['John','A',3],
        ['Nat','B',4]]
df = pd.DataFrame(data, columns=['Name', 'Key','Val'])

Upvotes: 2

Views: 66

Answers (3)

Dibyendu Dey
Dibyendu Dey

Reputation: 369

import pandas as pd
data = {"Id": ["DA0445EA", "DA0445EA", "DA0445EA", "DA0445EA", "DA0445EA"],
        "Port" : [1,1,1,1,1],
        "Lane" :[None,None,None,None,None],
        "Key" : ["start", "start","case","case_start","end"],
        "Val":[0.1,0.1,0.3,0.4,0.5] 
        }
df = pd.DataFrame(data,columns=['Id', 'Port','Lane','Key','Val'])
df1 = df.set_index(['Id', 'Port','Lane','Key'])['Val'].unstack()
df1 = df1.reset_index()

Below is Error generated. The reason is "There is two duplicate entry for same id". Such entry is unavoidable for datatable I'm working on. I was expecting, It will overwrite same line in table and won't throw any error like below.

raise ValueError('Index contains duplicate entries, '
ValueError: Index contains duplicate entries, cannot reshape

Is there a way to avoid such error during Re-shaping?

Upvotes: 0

Dibyendu Dey
Dibyendu Dey

Reputation: 369

Example input data:

data = {"Id": ["DA0445EA", "DA0445EA", "DA0445EA", "DA0445EA", "DA0445EA"],
        "Port" : [1,1,1,1,1],
        "Lane" :[None,None,None,None,None],
        "Key" : ["start_now", "start","case","case_start","end"],
        #"val": [0.000001,0.2,0.3,0.4,0.5]
        "Val":[0.000001,0.2,0.3,0.4,0.5] #need to capitalize this 'val' to match dataframe definition below
        }
df = pd.DataFrame(data,columns=['Id', 'Port','Lane','Key','Val'])
df1 = df.set_index(['Id', 'Port','Lane','Key'])['Val'].unstack()
df1 = df1.reset_index()

Exapected output:

Id Port Lane start_now start case case_start end
DA0445EA 1 nan 0.000001 0.2 0.3 0.4 0.5

Issue I'm getting:

  1. all floating number are shown as NAN
  2. individual columns (Id, Port, Lane) aren't shown as columns

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

Use reshaping like this, set_index with Name and Key creating a multiIndex, then unstack the inner most index level to create columns:

df.set_index(['Name','Key'])['Val'].unstack()

Output:

Key      A    B
Name           
David  1.0  NaN
John   3.0  NaN
Nat    NaN  4.0
Roe    NaN  2.0

Upvotes: 1

Related Questions