Reputation: 809
I have the following link above for a CSV file containing the raw data for which I wish to manipulate.
census_df = df = pd.read_csv('https://raw.githubusercontent.com/Qian-Han/coursera-Applied-Data-Science-with-Python/master/Introduction-to-Data-Science-in-Python/original_data/census.csv')
sortedit = census_df.sort_values(by = ['STNAME','CENSUS2010POP'],ascending=False)
I am trying to order the data in descending order by the column 'CENSUS2010POP'.
I also want to order the data by 'state' alphabetically, hence why I have including the 'STNAME' column in the formula above.
However, I only want to select the 3 highest values for 'CENSUS2010POP' from each state ('STNAME').
Thus, if there are 146 states in total, I should (146 x 3) rows in my new dataframe (and thus in the 'CENSUS2010POP' column).
I would be so grateful if anybody could give me a helping hand?
Upvotes: 0
Views: 92
Reputation: 23099
IIUC, groupby
with .nalrgest
to create an index filter, chained with sort_values
df2 = df.iloc[df.groupby('STNAME')['CENSUS2010POP']\
.nlargest(3).index.get_level_values(1)]\
.sort_values(['STNAME','CENSUS2010POP'],ascending=True)
print(df['STNAME'].nunique())
51
print(df2.shape)
(152, 100)
print(df2[['STNAME','CENSUS2010POP']])
STNAME CENSUS2010POP
49 Alabama 412992
37 Alabama 658466
0 Alabama 4779736
76 Alaska 97581
71 Alaska 291826
... ... ...
3137 Wisconsin 947735
3096 Wisconsin 5686986
3182 Wyoming 75450
3180 Wyoming 91738
3169 Wyoming 563626
[152 rows x 2 columns]
Upvotes: 1
Reputation: 612
try this:
df = census_df.groupby(["STNAME"]).apply(lambda x: x.sort_values(["CENSUS2010POP"], ascending = False)).reset_index(drop=True)
df.groupby('STNAME').head(3)[['STNAME','CENSUS2010POP']]
The first statement returns dataframe sorted by CENSUS2010POP
in each STNAME
.
The second statement returns the top 3.
Upvotes: 0