Reputation: 13
Maybe it is a really simple question but I am new in R and can't find the answer. Hopefully you will know :)
I am exploring the data base of Freedom in the World which contains data from 2013 till 2021.
While filtering, I found that some years there was 209 countries and others 210 countries. But I can't figure out which code/function I could use to find out which are being added/removed (maybe it's every time the same country, maybe it is not, I don't know).
The data.frame contains the variables of the year (Edition
) and names of the countries (Country/Territory
). Here you have a glimpse of the data frame:
> glimpse(head (Freedom_df))
Rows: 6
Columns: 44
$ `Country/Territory` <chr> "Abkhazia", "Afghanistan", "Albania", "Algeria", "Andorra", "Angola"
$ Region <chr> "Eurasia", "Asia", "Europe", "MENA", "Europe", "SSA"
$ `C/T` <chr> "t", "c", "c", "c", "c", "c"
$ Edition <dbl> 2021, 2021, 2021, 2021, 2021, 2021
$ Status <chr> "PF", "NF", "PF", "NF", "F", "NF"
$ `PR rating` <dbl> 5, 5, 3, 6, 1, 6
$ `CL rating` <dbl> 5, 6, 3, 5, 1, 5
$ A1 <dbl> 2, 1, 3, 1, 4, 0
$ A2 <dbl> 2, 1, 3, 1, 4, 2
$ A3 <dbl> 1, 1, 2, 1, 4, 1
$ A <dbl> 5, 3, 8, 3, 12, 3
$ B1 <dbl> 2, 2, 3, 1, 4, 1
$ B2 <dbl> 3, 2, 3, 1, 4, 1
$ B3 <dbl> 2, 1, 3, 1, 4, 1
$ B4 <dbl> 1, 2, 3, 1, 3, 2
$ B <dbl> 8, 7, 12, 4, 15, 5
$ C1 <dbl> 1, 1, 3, 1, 3, 1
$ C2 <dbl> 1, 1, 2, 1, 4, 1
$ C3 <dbl> 2, 1, 2, 1, 4, 0
$ C <dbl> 4, 3, 7, 3, 11, 2
$ `Add Q` <dbl> 0, 0, 0, 0, 0, 0
$ `Add A` <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"
$ PR <dbl> 17, 13, 27, 10, 38, 10
$ D1 <dbl> 2, 2, 2, 1, 3, 1
$ D2 <dbl> 2, 1, 4, 1, 3, 2
$ D3 <dbl> 1, 1, 3, 2, 4, 2
$ D4 <dbl> 3, 2, 4, 2, 4, 2
$ D <dbl> 8, 6, 13, 6, 14, 7
$ E1 <dbl> 3, 2, 3, 1, 4, 2
$ E2 <dbl> 2, 1, 3, 1, 4, 2
$ E3 <dbl> 1, 1, 2, 1, 3, 2
$ E <dbl> 6, 4, 8, 3, 11, 6
$ F1 <dbl> 1, 1, 2, 1, 4, 1
$ F2 <dbl> 1, 0, 2, 1, 4, 1
$ F3 <dbl> 1, 0, 2, 2, 4, 1
$ F4 <dbl> 1, 1, 3, 2, 3, 2
$ F <dbl> 4, 2, 9, 6, 15, 5
$ G1 <dbl> 1, 0, 3, 2, 4, 1
$ G2 <dbl> 1, 1, 2, 2, 4, 1
$ G3 <dbl> 2, 0, 2, 2, 3, 1
$ G4 <dbl> 1, 1, 2, 1, 4, 0
$ G <dbl> 5, 2, 9, 7, 15, 3
$ CL <dbl> 23, 14, 39, 22, 55, 21
$ Total <dbl> 40, 27, 66, 32, 93, 31
Here you can see what I mentioned about having 209 or 210 countries depending on the year:
> count(Freedom_df, Edition)
# A tibble: 9 x 2
Edition n
* <dbl> <int>
1 2013 209
2 2014 209
3 2015 210
4 2016 210
5 2017 209
6 2018 209
7 2019 209
8 2020 210
9 2021 210
Here are two reproducible examples with the expected outputs.
Example 1: In this case I supose that there are 209 countries that always remain the same and there is just one that is added and removed.
# A tibble: 9 x 4
Edition n Added_country Removed_country
* <dbl> <int> <chr> <chr>
1 2013 209 NA NA
2 2014 209 NA NA
3 2015 210 "country_name" NA
4 2016 210 NA NA
5 2017 209 NA "country_name"
6 2018 209 NA NA
7 2019 209 NA NA
8 2020 210 "country_name" NA
9 2021 210 NA NA
Example 2: In this case I supose there are 207 countries that remain the same all the years (2013:2021) and 3 other countries added/removed while maintaining the same counts.
# A tibble: 9 x 4
Edition n Different_country
* <dbl> <int> <chr>
1 2013 209 "country_A"
2 2013 209 "country_B"
3 2014 209 "country_A"
4 2014 209 "country_C"
5 2015 210 "country_A"
6 2015 210 "country_B"
7 2015 210 "country_C"
8 2016 210 "country_A"
9 2016 210 "country_B"
10 2016 210 "country_C"
11 2017 209 "country_B"
12 2017 209 "country_C"
13 2018 209 "country_B"
14 2018 209 "country_C"
15 2019 209 "country_B"
16 2019 209 "country_C"
17 2020 210 "country_A"
18 2020 210 "country_B"
19 2020 210 "country_C"
20 2021 210 "country_A"
21 2021 210 "country_B"
22 2021 210 "country_C"
I think that's enough information to solve it, let me know if you need any other details. Thanks :)
Upvotes: 0
Views: 126
Reputation: 26238
EDIT: since original data has been traced by @awaji98 in his/her answer, it can be seen that the following strategy works
freedom <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwUAEPTrb4AmNNYSdqCupsrXJcDOODfxTEVSZyK-yIAA2ozTGJmWLOnJHa3v-czcBitkfCx2AU_iqj/pub?gid=0&single=true&output=csv")
freedom %>% group_by(Edition) %>%
summarise(countries = list(Country)) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries, default = list(countries[[1]])), setdiff)) %>%
group_by(Edition) %>%
mutate(added = toString(unlist(added)),
removed = toString(unlist(removed)),
countries = length(unlist(countries))) %>%
ungroup()
# A tibble: 9 x 4
Edition countries removed added
<dbl> <int> <chr> <chr>
1 2013 209 "" ""
2 2014 209 "" ""
3 2015 210 "" "Crimea"
4 2016 210 "" ""
5 2017 209 "Puerto Rico" ""
6 2018 209 "" ""
7 2019 209 "" ""
8 2020 210 "" "Eastern Donbas"
9 2021 210 "" ""
Old answer
Let's first construct a data say freedom, since you have not added any
library(tidyverse)
set.seed(2021)
freedom <- data.frame(year = rep(2010:2014, each = 5),
country = c(sample(LETTERS[1:8], 5),
sample(LETTERS[1:8], 5),
sample(LETTERS[1:8], 5),
sample(LETTERS[1:8], 5),
sample(LETTERS[1:8], 5)),
val = round(100 * runif(25)))
freedom %>% pivot_wider(id_cols = country, names_from = year, names_sort = T, values_from = val) %>%
arrange(country)
#Let's have a look on this data
# A tibble: 8 x 6
country `2010` `2011` `2012` `2013` `2014`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 40 NA 76 52 63
2 B 82 NA 73 NA 20
3 C NA 90 NA NA 43
4 D NA 52 NA NA 54
5 E NA 93 29 36 NA
6 F 21 NA NA 23 NA
7 G 49 23 31 1 NA
8 H 68 62 70 88 17
Now the below syntax will give a list of year-wise added and removed countries. All countries will be treated as added during first year
freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries), setdiff)) %>%
select(-countries) %>%
unnest(c(added, removed))
# A tibble: 14 x 3
year removed added
<int> <chr> <chr>
1 2010 NA B
2 2010 NA G
3 2010 NA A
4 2010 NA F
5 2010 NA H
6 2011 B E
7 2011 A D
8 2011 F C
9 2012 D B
10 2012 C A
11 2013 B F
12 2014 G C
13 2014 F D
14 2014 E B
Or if don't want to see additions during first year, do this
freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
#group_by(year) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries), setdiff)) %>%
select(-countries) %>%
filter(as.numeric(row_number()) != 1) %>%
unnest(c(added, removed))
# A tibble: 9 x 3
year removed added
<int> <chr> <chr>
1 2011 B E
2 2011 A D
3 2011 F C
4 2012 D B
5 2012 C A
6 2013 B F
7 2014 G C
8 2014 F D
9 2014 E B
Or if you just want to see how many added and removed , do this
freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
#group_by(year) %>%
mutate(removed = unlist(map2(lag(countries), countries, function(x, y) length(setdiff(x, y)))),
added = unlist(map2(countries, lag(countries), function(x, y) length(setdiff(x, y))))) %>%
select(-countries)
# A tibble: 5 x 3
year removed added
<int> <int> <int>
1 2010 0 5
2 2011 3 3
3 2012 2 2
4 2013 1 1
5 2014 3 3
further update/edit As you have stated the countries count each year is not same, you may also adopt the following strategy
#let's edit the df to have unequal country count each year
freedom <- freedom[-c(9,21),]
#now
freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries), setdiff)) %>%
group_by(year) %>%
mutate(added = toString(unlist(added)),
removed = toString(unlist(removed)),
countries = length(unlist(countries))) %>%
ungroup()
# A tibble: 5 x 4
year countries removed added
<int> <int> <chr> <chr>
1 2010 5 "" B, G, A, F, H
2 2011 4 "B, A, F" E, D
3 2012 5 "D" B, A
4 2013 5 "B" F
5 2014 4 "G, F, E" D, B
I think now you can safely have an outlook on your data for additions/removals each year.
Upvotes: 0
Reputation: 685
Final Edit using the real data
Sourced real data here:
Freedom_df <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwUAEPTrb4AmNNYSdqCupsrXJcDOODfxTEVSZyK-yIAA2ozTGJmWLOnJHa3v-czcBitkfCx2AU_iqj/pub?gid=0&single=true&output=csv")
Applying the code from second edit:
Freedom_df %>% mutate(Country = factor(Country)) %>%
count(Edition, Country, .drop = FALSE) %>% filter(n == 0)
Result:
Edition Country n
<dbl> <fct> <int>
1 2013 Crimea 0
2 2013 Eastern Donbas 0
3 2014 Crimea 0
4 2014 Eastern Donbas 0
5 2015 Eastern Donbas 0
6 2016 Eastern Donbas 0
7 2017 Eastern Donbas 0
8 2017 Puerto Rico 0
9 2018 Eastern Donbas 0
10 2018 Puerto Rico 0
11 2019 Eastern Donbas 0
12 2019 Puerto Rico 0
13 2020 Puerto Rico 0
14 2021 Puerto Rico 0
Just to clarify which years these three countries are present, we can create a vector of them and call it missing:
missing <- Freedom_df %>% mutate(Country = factor(Country)) %>%
count(Edition, Country, .drop = FALSE) %>% filter(n == 0 )%>%
pull %>% as.character
Then see which years the missing countries are present in the date
Freedom_df %>% filter(Country %in% missing) %>%
select(Country,Edition) %>% arrange(Country,Edition)
Result:
# A tibble: 13 x 2
Country Edition
<chr> <dbl>
1 Crimea 2015
2 Crimea 2016
3 Crimea 2017
4 Crimea 2018
5 Crimea 2019
6 Crimea 2020
7 Crimea 2021
8 Eastern Donbas 2020
9 Eastern Donbas 2021
10 Puerto Rico 2013
11 Puerto Rico 2014
12 Puerto Rico 2015
13 Puerto Rico 2016
Second Edit based on OP's example
I'm not sure what your other column names are, but assuming you have Edition and Country for the year and Country/Territory respectively, you could try using count with Edition set as a factor variable.
Freedom_df %>%
# change country column to factor
mutate(Country = factor(`Country/Territory`)) %>%
# count with .drop = FALSE to add 0's for any missing countries
count(Edition, Country, .drop = FALSE) %>%
# filter to find them
filter(n == 0)
Example:
# Example dataframe
df <- structure(list(Edition = c(2021, 2021, 2021, 2021, 2021, 2021,
2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2019, 2019, 2018,
2018, 2018, 2018, 2018, 2018), Country = c("Barbados", "Belarus",
"Belgium", "Belize", "Benin", "Bhutan", "Belarus", "Barbados",
"Belize", "Bhutan", "Benin", "Belgium", "Barbados", "Belize",
"Belarus", "Bhutan", "Barbados", "Belarus", "Belgium", "Belize",
"Benin", "Bhutan")), row.names = c(NA, -22L), class = c("tbl_df",
"tbl", "data.frame"))
# the code
df %>% mutate(Country = factor(Country)) %>%
count(Edition, Country, .drop = FALSE) %>%
filter(n == 0)
Results in:
Edition Country n
<dbl> <fct> <int>
1 2019 Benin 0
2 2020 Belgium 0
First edit If all the data are in one dataframe, you can split the dataframe by year first, and then use the anti_join on the list objects as above.
## split the dataframe by year
df <- df %>% group_split(year) %>% setNames(unique(df$year))
## do the anti_join with smaller subset last
anti_join(df$`2021`,df$`2020`, by = "country")
Here is an example of it in practice:
## An example df (reproduced by dput())
df <- structure(list(year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,
2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021),
country = c("Abkhazia*", "Afghanistan", "Albania", "Algeria",
"Andorra", "Angola", "Antigua and Barbuda", "Argentina",
"Armenia", "Australia", "Austria", "Azerbaijan", "Bangladesh",
"Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan",
"Abkhazia*", "Afghanistan", "Albania", "Algeria", "Andorra",
"Angola", "Antigua and Barbuda", "Argentina", "Armenia",
"Australia", "Austria", "Azerbaijan", "Bahrain", "Bangladesh",
"Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan"
), score = c("40", "27", "66", "32", "93", "31", "85", "84",
"55", "97", "93", "10", "39", "95", "11", "96", "87", "65",
"61", "40", "27", "66", "32", "93", "31", "85", "84", "55",
"97", "93", "10", "12", "39", "95", "11", "96", "87", "65",
"61")), row.names = c(NA, -39L), class = c("tbl_df", "tbl",
"data.frame"))
This is the df:
year country score
<dbl> <chr> <chr>
1 2020 Abkhazia* 40
2 2020 Afghanistan 27
3 2020 Albania 66
4 2020 Algeria 32
5 2020 Andorra 93
6 2020 Angola 31
7 2020 Antigua and Barbuda 85
8 2020 Argentina 84
9 2020 Armenia 55
10 2020 Australia 97
# … with 29 more rows
## split the dataframe by year
df <- df %>% group_split(year) %>% setNames(unique(df$year))
Results in:
$`2020`
# A tibble: 19 x 3
year country score
<dbl> <chr> <chr>
1 2020 Abkhazia* 40
2 2020 Afghanistan 27
3 2020 Albania 66
4 2020 Algeria 32
5 2020 Andorra 93
6 2020 Angola 31
7 2020 Antigua and Barbuda 85
8 2020 Argentina 84
9 2020 Armenia 55
10 2020 Australia 97
11 2020 Austria 93
12 2020 Azerbaijan 10
13 2020 Bangladesh 39
14 2020 Barbados 95
15 2020 Belarus 11
16 2020 Belgium 96
17 2020 Belize 87
18 2020 Benin 65
19 2020 Bhutan 61
$`2021`
# A tibble: 20 x 3
year country score
<dbl> <chr> <chr>
1 2021 Abkhazia* 40
2 2021 Afghanistan 27
3 2021 Albania 66
4 2021 Algeria 32
5 2021 Andorra 93
6 2021 Angola 31
7 2021 Antigua and Barbuda 85
8 2021 Argentina 84
9 2021 Armenia 55
10 2021 Australia 97
11 2021 Austria 93
12 2021 Azerbaijan 10
13 2021 Bahrain 12
14 2021 Bangladesh 39
15 2021 Barbados 95
16 2021 Belarus 11
17 2021 Belgium 96
18 2021 Belize 87
19 2021 Benin 65
20 2021 Bhutan 61
Then do the anti_join
## do the anti_join with smaller subset last
anti_join(df$`2021`,df$`2020`, by = "country")
Result:
# A tibble: 1 x 3
year country score
<dbl> <chr> <chr>
1 2021 Bahrain 12
Upvotes: 0