P.Weyh
P.Weyh

Reputation: 97

Count multiple occurences exactly once

I have a dataframe that kinda looks like this:

Year Name      Region
2018 Smith     US
2018 James     US
2018 John      Other
2017 John      Other
2017 Peter     US
2017 Jeanette  US
2017 Lauren    US
2017 James     US
2016 Jeanette  US
2016 James     EU
2015 Peter     EU
...
2000 Peter     EU
2000 Mary      US
2000 Jeanette  EU
1999 James     EU
1998 John      EU

Now what I want to do is track the region for these names for the years <= 2011 and > 2011. The problem here is that the names appear multiple times throughout the years but need to be registered exactly once up to and including 2011 and after 2011. What's more, I need to register every individual that has a "EU" tag before 2012 and changes this tag after 2011 (like e.g. Jeanette here) and need to count those individuals exactly once. To make matters more difficult, a person might keep their "EU" tag up until 2018 (like James here) but change it in 2017 going forward, but I need to register this as "change" of region anyway, even if their tags are still "EU" 2012-2017.

This is all to answer the question: how many scientists have left the EU after 2011 and how many have stayed?

On another note, I also want to do this with researchers that come into the EU after 2011 by identifying researchers that have a US or Other region tag pre-2012 and change it after 2011.

What I tried already to count occurences is:

sum(df$Region[df$Name == "James"])

but that of course counts all James's for me and it is also not viable since there are approximately 9600 entries and I can't check all of those by hand to create functions for each and every unique individual.

My ideal output would kinda look like (numbers are made up):

Number of researchers that left the EU after 2011: 76
Number of reserachers that came to the EU after 2011: 24
Total number of researchers in the EU pre-2012: 264
Total number of researchers in the EU post-2012: 115

If possible, I would even like to do that per year. Then it would look kinda like this:

Number of researchers that left the EU in 2018: 76
Number of reserachers that came to the EU 2018: 24
Total number of researchers in the EU in 2018: 264

Number of researchers that left the EU in 2017: 23
Number of reserachers that came to the EU 2017: 11
Total number of researchers in the EU in 2017: 105

...

Number of researchers that left the EU in 2012: 20
Number of reserachers that came to the EU 2012: 5
Total number of researchers in the EU in 2012: 160

I apologize for my lack of sophisticated code for things I have tried already but I am pretty new to R and kinda out of ideas. Any help is appreciated, thank you very much!

Upvotes: 0

Views: 42

Answers (1)

GKi
GKi

Reputation: 39727

You can first order by Name and Year and then set if the person has came or left a region. With this you can show with table how many are there, came or have left per Year and Region:

df <- df[order(df$Name, df$Year),]
df$came  <- as.logical(with(df, ave(as.character(Region), Name, FUN=function(x) {c(FALSE, x[-length(x)] != x[-1])})))
df$left  <- c(df$came[-1], FALSE)

with(df, table(Year, Region)) #Total number of researchers per Year and Region
with(df[df$came,], table(Year, Region)) #Number of researchers that came per Year and Region
with(df[df$left,], table(Year, Region)) #Number of researchers that left per Year and Region

Upvotes: 1

Related Questions