Reputation: 2231
I have a list of countries by year, like so
country year founding_year other_vars
Fake 1900 1950 data
Fake 1901 1950 data
Fake 1902 1950 data
(...)
USE 1900 1901 data
USE 1901 1901 data
USE 1901 1901 data
The founding_year
is a bit confusing at first, but what the dataset is doing, is that its tracking the countries that are countries in 2001
, per year - collecting various statistics on them.
No I want to create a graph to show the creation / founding of nations over time. I already have the year
variable for the X axis, and the founding_year
variable supplies the information I need - but I am having trouble with the groupby
operation to get the number of new nations per founding year
I use the following command:
df.groupby(['founding_years', 'country']).size()
I chose both the founding_year
and country
variables to make sure that I have unique pairs (as there are multiple rows per nation)
However, this give me an erroneous result.
founding_year country
1945 Austria 46
Poland 46
1946 Jordan 46
Lebanon 46
Philippines 46
Syria 16
1947 India 46
Pakistan 25
1948 Israel 46
Myanmar 46
North Korea 46
South Korea 46
Sri Lanka 46
It instead returns the number of lines that this nation has in the database. A .count()
command gives the same result.
I tried adding year
at the end of the groupby function, but that makes it just return a whole bunch of unique values.
I am a bit puzzled, all the information I need is there, but I seem to be unable to figure out how to get it - do any of you know what I am missing?
Upvotes: 3
Views: 7237
Reputation: 164773
For the number of nations per founding year, you should group only by year. For example:
df = pd.DataFrame([['c1', 1950], ['c1', 1950], ['c1', 1950],
['c2', 1960], ['c2', 1960], ['c2', 1960],
['c3', 1970], ['c3', 1970], ['c3', 1970],
['c4', 1960], ['c4', 1960], ['c4', 1960],
['c5', 1950], ['c5', 1950], ['c5', 1950]],
columns=['country', 'year'])
res = df.groupby('year')['country'].nunique().reset_index()
print(res)
year country
0 1950 2
1 1960 2
2 1970 1
Or, if you need add a count series to your original dataframe, use pd.DataFrame.transform
:
df['count'] = df.groupby('year')['country'].transform('nunique')
print(df)
country year count
0 c1 1950 2
1 c1 1950 2
2 c1 1950 2
...
6 c3 1970 1
7 c3 1970 1
8 c3 1970 1
9 c4 1960 2
...
13 c5 1950 2
14 c5 1950 2
Upvotes: 7