Reputation: 1
I am stuck on a simple task. I want to create an empty DataFrame and append rows to it based on a query of another dataset. I have tried the answers here but I am missing something ..beginner Pythoner. Any help would be appreciated. I want to take the top 3 rows of each state and add them into a new dataframe for processing. I tried to append also..
def test():
#get the list of states
states_df = census_df.STNAME.unique()
population_df = pd.DataFrame()
for st in states_df:
temp_df = pd.DataFrame(census_df[census_df['STNAME'] == st].nlargest(3,'CENSUS2010POP'))
pd.concat([temp_df, population_df], ignore_index = True)
return 1
Upvotes: 0
Views: 87
Reputation: 634
I think I know what course you're doing, I had a great time with that a year ago, keep it up!
The simplest/fastest way I've found to concatenate a bunch of sliced dataframes is to append each df to a list, then at the end just concatenate that list. See the working code below (it does what I interpret you meant).
I agree with David's suggestion on sorting, easier to use sort and then just slice the first 3. As nlargest() works on and returns a Series I believe and not a dataframe, whereas you want to keep the whole dataframe structure (all the columns) for concatenation.
Also why is your function returning 1? Typo? I guess you want to return your desired output if you're putting it in a function, so I changed that too.
import pandas as pd
import numpy as np
#create fake data random numbers
data = np.random.randint(2,11,(40,3))
census_df = pd.DataFrame(index=range(40), columns=['Blah', 'Blah2','CENSUS2010POP'], data=data)
#create fake STNAME column
census_df['STNAME'] = list('aaaabbbbccccddddeeeeffffgggghhhhiiiijjjj')
#Function:
def test(census_df):
states_list = census_df.STNAME.unique() #changed naming to _list as it's not a df.
list_of_dfs = list() #more efficient to append each df to a list
for st in states_list:
temp_df = census_df[census_df['STNAME']==st]
temp_df = temp_df.sort_values(by=['CENSUS2010POP'], ascending=False).iloc[:3]
list_of_dfs.append(temp_df)
population_df = pd.concat(list_of_dfs,ignore_index=True)
return population_df
population_df = test(census_df)
Upvotes: 1
Reputation: 220
Welcome to SO! Is your problem appending or the top three rows?
For append, try the df.append function. It could look something like:
#get the list of states
states_df = census_df.STNAME.unique()
population_df = pd.DataFrame()
for st in states_df:
temp_df = pd.DataFrame(census_df[census_df['STNAME'] == st].nlargest(3,'CENSUS2010POP'))
population_df = population_df.append(temp_df, ignore_index = True) #append the temp df to your main df, ignoring the index
For the top rows you could us df.sort_values(by=['column name'],ascending=False) and then select the top three rows:
population_df = population_df.append(temp_df[0:3], ignore_index = True)
Upvotes: 0