eshaan kirpal
eshaan kirpal

Reputation: 41

how to find the three largest values in a data frame?

Data Set in Question:

enter image description here

For the data set show in the above image, I am trying to find the three most populous states while only taking into consideration the three most populous counties for each state. I use CENSUS2010POP.

This function should return a list of string values(in order of highest population to lowest population).

Below is My Code:

x=census_df.groupby('STNAME')['CENSUS2010POP'].nlargest(3)

This statement returns a series in which it groups the three most populous counties of each state.

Now, what do I do beyond this to further find the most populous states?

Do I use loops or is there another efficient method to approach this?

Upvotes: 3

Views: 9436

Answers (5)

nOObda
nOObda

Reputation: 123

temp provides data w.r.t State and County (maintaining the SUMLEV criteria as well)

tempo provides top 3 counties for each state

temp = census_df[census_df.SUMLEV==50].groupby(['STNAME','CTYNAME']).agg({"CENSUS2010POP":sum})
tempo = pd.DataFrame(temp['CENSUS2010POP'].groupby(level=0, group_keys=False).nlargest(3))
tempo.CENSUS2010POP.groupby(level=0, group_keys=False).sum().nlargest(3).index.tolist()

Output

STNAME
California    37253956
Texas         25145561
New York      19378102
Name: CENSUS2010POP, dtype: int64

Upvotes: 0

Nayan_py
Nayan_py

Reputation: 21

Below is my take on this. I know generating a dataframe using iteraion is not a recommended approach, so if there are better suggestions please do share.

fulldf3 = pd.DataFrame()
for i in census_df['STNAME'].unique():
    top3=census_df[(census_df['STNAME']==i) & (census_df['SUMLEV']==50)].groupby(['STNAME','CTYNAME'])['CENSUS2010POP'].sum().nlargest(3)
    dftop3=pd.DataFrame(top3)
    fulldf3=fulldf3.append(dftop3)

#gives the top3 states with population sum from top3 populous counties
fulldf3['CENSUS2010POP'].sum(level=0).nlargest(3)

#gives the names of the states in a list of strings
fulldf3['CENSUS2010POP'].sum(level=0).nlargest(3).index.to_list()

Output:

  • California 15924150
  • Texas 8269632
  • Illinois 6815061

Upvotes: 2

Jay Mulani
Jay Mulani

Reputation: 1

def answer_six():
    lst=[];
    df=census_df.groupby(['STNAME'])
    df=df.sum()
    for i in range(3):
        fd=df['CENSUS2010POP'].max()
        df_temp=df[df['CENSUS2010POP']==fd]
        lst.append(df_temp.index[0])

        df=df.drop(df_temp.index[0],axis=0)
    return lst
answer_six()

Upvotes: 0

yogs
yogs

Reputation: 53

def answer_six():
    county = census_df[census_df['SUMLEV']==50]
    states = county.groupby('STNAME')['CENSUS2010POP'].apply(lambda x: x.nlargest(3).sum()).nlargest(3).index.values.tolist()

    return states
answer_six()

Upvotes: 2

pawelty
pawelty

Reputation: 1000

Try adding sum() to the function:

df = pd.DataFrame({'STNAME': ['Alabama', 'Alabama', 'Alabama', 'Alabama', 'Alaska', 'Alaska', 'Alaska', 'Alaska','New York', 'New York', 'New York', 'New York', 'California'], 
                   'CTYNAME': ['a', 'b', 'c', 'd', 'z', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'], 
                   'CENSUS2010POP': [120000, 230000, 89000, 43000, 23000, 34000, 120000, 10000, 1200000, 1300000, 9800000, 560000, 45000]})

x=df.groupby('STNAME')['CENSUS2010POP'].sum().nlargest(3)

Output:

STNAME
New York    12860000
Alabama       482000
Alaska        187000
Name: CENSUS2010POP, dtype: int64

Upvotes: 5

Related Questions