Jasper
Jasper

Reputation: 2231

Get group size in pandas dataframe

I have a list of countries by year, like so

country  year  founding_year  other_vars
 Fake    1900    1950             data
 Fake    1901    1950             data
 Fake    1902    1950             data
             (...)
 USE     1900    1901             data
 USE     1901    1901             data
 USE     1901    1901             data

The founding_year is a bit confusing at first, but what the dataset is doing, is that its tracking the countries that are countries in 2001, per year - collecting various statistics on them.

No I want to create a graph to show the creation / founding of nations over time. I already have the year variable for the X axis, and the founding_year variable supplies the information I need - but I am having trouble with the groupby operation to get the number of new nations per founding year

I use the following command:

df.groupby(['founding_years', 'country']).size()

I chose both the founding_year and country variables to make sure that I have unique pairs (as there are multiple rows per nation)

However, this give me an erroneous result.

founding_year  country                 
1945     Austria                     46
         Poland                      46
1946     Jordan                      46
         Lebanon                     46
         Philippines                 46
         Syria                       16
1947     India                       46
         Pakistan                    25
1948     Israel                      46
         Myanmar                     46
         North Korea                 46
         South Korea                 46
         Sri Lanka                   46

It instead returns the number of lines that this nation has in the database. A .count() command gives the same result. I tried adding year at the end of the groupby function, but that makes it just return a whole bunch of unique values.

I am a bit puzzled, all the information I need is there, but I seem to be unable to figure out how to get it - do any of you know what I am missing?

Upvotes: 3

Views: 7237

Answers (1)

jpp
jpp

Reputation: 164773

For the number of nations per founding year, you should group only by year. For example:

df = pd.DataFrame([['c1', 1950], ['c1', 1950], ['c1', 1950],
                   ['c2', 1960], ['c2', 1960], ['c2', 1960],
                   ['c3', 1970], ['c3', 1970], ['c3', 1970],
                   ['c4', 1960], ['c4', 1960], ['c4', 1960],
                   ['c5', 1950], ['c5', 1950], ['c5', 1950]],
                  columns=['country', 'year'])

res = df.groupby('year')['country'].nunique().reset_index()

print(res)

   year  country
0  1950        2
1  1960        2
2  1970        1

Or, if you need add a count series to your original dataframe, use pd.DataFrame.transform:

df['count'] = df.groupby('year')['country'].transform('nunique')

print(df)

   country  year  count
0       c1  1950      2
1       c1  1950      2
2       c1  1950      2
...
6       c3  1970      1
7       c3  1970      1
8       c3  1970      1
9       c4  1960      2
...
13      c5  1950      2
14      c5  1950      2

Upvotes: 7

Related Questions