Reputation: 8009
I have the following data:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/clean_gdp_data_all.csv'
c=pd.read_csv(url, index_col=0)
c = c.loc[(c.GeoName == 'California') & \
(c.ComponentName == 'Real GDP by state')]
c.head(3)
GeoName ComponentName IndustryClassification Description 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
38281 California Real GDP by state 111-112 Farms 15717 18751 18215 15335 14109 18798 19197 16535 15014 16909 0
38282 California Real GDP by state 113-115 Forestry, fishing, and related activities 6234 6278 7845 7786 7365 7390 7831 8115 8995 9312 0
38284 California Real GDP by state 211 Oil and gas extraction 7769 8107 10693 12342 12010 17155 14575 15289 18849 16165 0
I want to run the following code with a for loop, except that I want to run it for every year (2004-2014) and then merge them all together, as indicated in the last line of code:
d = c.sort_values('2004', ascending=False).head(10)[['GeoName', \
'IndustryClassification', 'Description', 'ComponentName', '2004' ]]
e = c.sort_values('2005', ascending=False).head(10)[['GeoName', \
'IndustryClassification', 'Description', 'ComponentName', '2005' ]]
crgdp = pd.merge(d,e, how='inner', on=['GeoName', \
'IndustryClassification', 'Description', 'ComponentName'])
Upvotes: 1
Views: 720
Reputation: 210842
I think you can't do this in the way you want to do it because all values in one row are "connected" and belong to that row. So you can sort your DF by one column, which will reorder all rows with all corresponding values, but next time when you will sort another column - you'll lose sorted order in the first column and so on...
Look at index values and at values in a
and b
columns in the following example:
In [16]: df
Out[16]:
a b c
0 0 7 1
1 6 6 0
2 7 4 5
In [17]: df.sort_values(by='a', ascending=False)
Out[17]:
a b c
2 7 4 5
1 6 6 0
0 0 7 1
In [18]: df.sort_values(by='b', ascending=False)
Out[18]:
a b c
0 0 7 1
1 6 6 0
2 7 4 5
In [19]: df.sort_values(by=['a','b'], ascending=False)
Out[19]:
a b c
2 7 4 5
1 6 6 0
0 0 7 1
NOTE: doesn't matter how do we sort our data all values in each row are "bound" to each other and to their index.
So you can either sort your DF by a
or by b
or by ['a','b']
, but in this case your b
column will not be monotonically decreasing.
And looking at your data - if you would group your data by your "merging" columns and check for duplicates you would see that you don't have any of them:
In [132]: c.groupby(['GeoName', 'IndustryClassification', 'Description', 'ComponentName']).size().nlargest(3)
Out[132]:
GeoName IndustryClassification Description ComponentName
California ... Federal civilian Real GDP by state 1
Federal military Real GDP by state 1
State and local Real GDP by state 1
dtype: int64
It shows that each group has exactly one row. So after merging all your rows will remain the same, because you can consider ['GeoName', 'IndustryClassification', 'Description', 'ComponentName']
columns as a primary key (i.e. unique identifier).
Here is an example:
In [125]: c.query("GeoName == 'California' and IndustryClassification == '111-112' and Description == 'Farms' and ComponentName == 'Real GDP by s
tate'")
Out[125]:
GeoName ComponentName IndustryClassification Description \
38281 California Real GDP by state 111-112 Farms
2004 2005 2006 2007 2008 2009 2010 2011 \
38281 15717.0 18751.0 18215.0 15335.0 14109.0 18798.0 19197.0 16535.0
2012 2013 2014
38281 15014.0 16909.0 0.0
Upvotes: 2
Reputation: 4070
Here you go, it shall help you to get forward:
import pandas as pd
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/clean_gdp_data_all.csv'
c=pd.read_csv(url, index_col=0)
d = c.loc[(c.GeoName == 'California') & (c.ComponentName == 'Real GDP by state')]
for y1, y2 in zip(c.columns[4:], c.columns[5:]):
d1 = d.sort_values(y1, ascending=False).head(10)[['GeoName','IndustryClassification', 'Description', 'ComponentName', y1 ]]
e1 = d.sort_values(y2, ascending=False).head(10)[['GeoName','IndustryClassification', 'Description', 'ComponentName', y2 ]]
crgdp = pd.merge(d1,e1, how='inner', on=['GeoName','IndustryClassification', 'Description', 'ComponentName'])
crgdp.to_csv('{0}-{1}.csv'.format(y1,y2), index=False)
Upvotes: 2