cgxanth
cgxanth

Reputation: 65

Pivot a dataframe in python

I have the below csv file

enter image description here

And after reading it with python in a DataFrame,I want to reshape it as
country,year1,value
country,year2,value

Taking into account only the country the year and the value.. So I will end up with 3 variables.

You can find the data set here:
http://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG

Upvotes: 2

Views: 110

Answers (2)

jezrael
jezrael

Reputation: 862661

You can use:

  • read_csv with omit first 4 rows by skiprows
  • check if last column has all NaN values by isnull with all
  • all values are NaNs, so you can omit it, one possible solution is use iloc for select first 61 columns
  • reshape by set_index with unstack
  • last reset_index with rename column by dict

df = pd.read_csv('API_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2.csv', skiprows=4)
print (df.head())
  Country Name Country Code         Indicator Name     Indicator Code  1960  \
0        Aruba          ABW  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG   NaN   
1  Afghanistan          AFG  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG   NaN   
2       Angola          AGO  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG   NaN   
3      Albania          ALB  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG   NaN   
4      Andorra          AND  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG   NaN   

   1961  1962  1963  1964  1965     ...            2008       2009      2010  \
0   NaN   NaN   NaN   NaN   NaN     ...       -6.881302  -5.653502       NaN   
1   NaN   NaN   NaN   NaN   NaN     ...        3.611368  21.020649  8.433290   
2   NaN   NaN   NaN   NaN   NaN     ...       13.817146   2.412870  3.407655   
3   NaN   NaN   NaN   NaN   NaN     ...        7.530000   3.350000  3.710000   
4   NaN   NaN   NaN   NaN   NaN     ...       -8.594256  -3.817986 -5.347977   

       2011       2012      2013      2014      2015      2016  Unnamed: 61  
0       NaN        NaN       NaN       NaN       NaN       NaN          NaN  
1  6.113685  14.434741  1.959123  1.312531  1.112558  2.232272          NaN  
2  3.918597   5.155441  6.813586  4.804473  3.006981  0.000000          NaN  
3  2.550000   1.420000  1.110000  1.800000  2.590000  3.460000          NaN  
4 -4.802675  -1.760010 -0.063514       NaN       NaN       NaN          NaN

cols = ['Country Name','Country Code','Indicator Name','Indicator Code']

print (df.iloc[:, 61].isnull().all())
True    

d = {'level_4':'year'}
df = df.iloc[:, :60]
       .set_index(cols)
       .stack()
       .reset_index(name='vals')
       .rename(columns=d)
print (df.head())
  Country Name Country Code         Indicator Name     Indicator Code  year  \
0        Aruba          ABW  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG  1995   
1        Aruba          ABW  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG  1996   
2        Aruba          ABW  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG  1997   
3        Aruba          ABW  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG  1998   
4        Aruba          ABW  GDP growth (annual %)  NY.GDP.MKTP.KD.ZG  1999   

       vals  
0  1.245086  
1  7.814432  
2  6.666622  
3  1.154469  
4  4.514062

For 3 columns is solution very similar, for remove columns use drop:

df = pd.read_csv('API_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2.csv', skiprows=4)

d = {'level_1':'year'}
df= df.drop(['Country Code','Indicator Name','Indicator Code', 'Unnamed: 61'], axis=1)
df = df.set_index('Country Name').stack().reset_index(name='vals').rename(columns=d)
print (df.head())
  Country Name  year      vals
0        Aruba  1995  1.245086
1        Aruba  1996  7.814432
2        Aruba  1997  6.666622
3        Aruba  1998  1.154469
4        Aruba  1999  4.514062

Upvotes: 3

question.it
question.it

Reputation: 2968

You can try this:

df = pd.read_csv('Your file path.csv', dtype = object) # read file
df.drop(['Country Name','Indicator Name'], axis = 1, inplace = True) # delete unwanted columns
dfT = pd.melt(df, id_vars=['Country Code','Indicator Code']) # transpose
dfF = dfT.dropna(subset=['value']) # delete Nan

Upvotes: 2

Related Questions