Reputation: 101
I am currently doing a project with baby name data. I am looking at the most popular male and female baby names in each decade starting with the 1950s. I am trying to create a function that will print out the top name for the data set that I input.
So far I have successfully created two datasets for each decade (one male and the other female)
This is the code that I have for the function but I can't seem to figure out how to make it work...
def getTopName(data):
(data
.drop(columns =['sex', 'prop'])
.pivot(index = 'name', columns = 'year', values = 'n')
.sum(axis=1) = data['decade']
.sort_values(by = 'decade', ascending = False))
print data[0:1]
Any suggestions on how to accomplish this?
Its currently in longform. Can i create a middle function that converts it to wide form and builds a new column where the totals from each year (1960, 1961, ... 1969) can be added together?
Upvotes: 3
Views: 1843
Reputation: 603
Question 1 - Name with highest n per year
df.groupby(by='name', as_index=False)
.count()
.nlargest(1, 'number')
.iloc[0]["name"]
Question 2 - Transform data to wideform
Sample data on which this was tested
Pivot in pandas does not do aggregations. So I split up the steps in getting totals per year and totals per decade. Finally, I join those two to get the desired result:
import pandas as pd
df = pd.read_csv('set2.csv')
# add decade column
df["decade"] = df["year"] - (df["year"] % 10)
# add decade_title column to prevent join clashes
df["decade_total"] = df["decade"]
.apply(lambda decade_num: f"{str(decade_num)}_total")
# first pivot with n per year
per_year_df = df.pivot(index="name", columns="year", values="n")
# pivot cannot aggregate so we first aggregate and then pivot
per_decade_df = df\
.groupby(by=["decade_total", "name"], as_index=False)\
.agg({"n": 'sum'})\
.pivot(index="name", columns="decade_total", values="n")
# finally we join the decade totals to the yearly counts
joined_df = per_year_df.join(per_decade_df)
Upvotes: 1