Reputation: 65
I have the below csv file
And after reading it with python in a DataFrame,I want to reshape it as
country,year1,value
country,year2,value
Taking into account only the country the year and the value.. So I will end up with 3 variables.
You can find the data set here:
http://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG
Upvotes: 2
Views: 110
Reputation: 862661
You can use:
read_csv
with omit first 4 rows by skiprows
isnull
with all
iloc
for select first 61
columnsset_index
with unstack
reset_index
with rename
column by dict
df = pd.read_csv('API_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2.csv', skiprows=4)
print (df.head())
Country Name Country Code Indicator Name Indicator Code 1960 \
0 Aruba ABW GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN
1 Afghanistan AFG GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN
2 Angola AGO GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN
3 Albania ALB GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN
4 Andorra AND GDP growth (annual %) NY.GDP.MKTP.KD.ZG NaN
1961 1962 1963 1964 1965 ... 2008 2009 2010 \
0 NaN NaN NaN NaN NaN ... -6.881302 -5.653502 NaN
1 NaN NaN NaN NaN NaN ... 3.611368 21.020649 8.433290
2 NaN NaN NaN NaN NaN ... 13.817146 2.412870 3.407655
3 NaN NaN NaN NaN NaN ... 7.530000 3.350000 3.710000
4 NaN NaN NaN NaN NaN ... -8.594256 -3.817986 -5.347977
2011 2012 2013 2014 2015 2016 Unnamed: 61
0 NaN NaN NaN NaN NaN NaN NaN
1 6.113685 14.434741 1.959123 1.312531 1.112558 2.232272 NaN
2 3.918597 5.155441 6.813586 4.804473 3.006981 0.000000 NaN
3 2.550000 1.420000 1.110000 1.800000 2.590000 3.460000 NaN
4 -4.802675 -1.760010 -0.063514 NaN NaN NaN NaN
cols = ['Country Name','Country Code','Indicator Name','Indicator Code']
print (df.iloc[:, 61].isnull().all())
True
d = {'level_4':'year'}
df = df.iloc[:, :60]
.set_index(cols)
.stack()
.reset_index(name='vals')
.rename(columns=d)
print (df.head())
Country Name Country Code Indicator Name Indicator Code year \
0 Aruba ABW GDP growth (annual %) NY.GDP.MKTP.KD.ZG 1995
1 Aruba ABW GDP growth (annual %) NY.GDP.MKTP.KD.ZG 1996
2 Aruba ABW GDP growth (annual %) NY.GDP.MKTP.KD.ZG 1997
3 Aruba ABW GDP growth (annual %) NY.GDP.MKTP.KD.ZG 1998
4 Aruba ABW GDP growth (annual %) NY.GDP.MKTP.KD.ZG 1999
vals
0 1.245086
1 7.814432
2 6.666622
3 1.154469
4 4.514062
For 3 columns is solution very similar, for remove columns use drop
:
df = pd.read_csv('API_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2.csv', skiprows=4)
d = {'level_1':'year'}
df= df.drop(['Country Code','Indicator Name','Indicator Code', 'Unnamed: 61'], axis=1)
df = df.set_index('Country Name').stack().reset_index(name='vals').rename(columns=d)
print (df.head())
Country Name year vals
0 Aruba 1995 1.245086
1 Aruba 1996 7.814432
2 Aruba 1997 6.666622
3 Aruba 1998 1.154469
4 Aruba 1999 4.514062
Upvotes: 3
Reputation: 2968
You can try this:
df = pd.read_csv('Your file path.csv', dtype = object) # read file
df.drop(['Country Name','Indicator Name'], axis = 1, inplace = True) # delete unwanted columns
dfT = pd.melt(df, id_vars=['Country Code','Indicator Code']) # transpose
dfF = dfT.dropna(subset=['value']) # delete Nan
Upvotes: 2