xescola
xescola

Reputation: 13

How do I find the country which is added/removed a specific year in a data base?

Maybe it is a really simple question but I am new in R and can't find the answer. Hopefully you will know :)

I am exploring the data base of Freedom in the World which contains data from 2013 till 2021.

While filtering, I found that some years there was 209 countries and others 210 countries. But I can't figure out which code/function I could use to find out which are being added/removed (maybe it's every time the same country, maybe it is not, I don't know).

The data.frame contains the variables of the year (Edition) and names of the countries (Country/Territory). Here you have a glimpse of the data frame:

    > glimpse(head (Freedom_df))
Rows: 6
Columns: 44
$ `Country/Territory` <chr> "Abkhazia", "Afghanistan", "Albania", "Algeria", "Andorra", "Angola"
$ Region              <chr> "Eurasia", "Asia", "Europe", "MENA", "Europe", "SSA"
$ `C/T`               <chr> "t", "c", "c", "c", "c", "c"
$ Edition             <dbl> 2021, 2021, 2021, 2021, 2021, 2021
$ Status              <chr> "PF", "NF", "PF", "NF", "F", "NF"
$ `PR rating`         <dbl> 5, 5, 3, 6, 1, 6
$ `CL rating`         <dbl> 5, 6, 3, 5, 1, 5
$ A1                  <dbl> 2, 1, 3, 1, 4, 0
$ A2                  <dbl> 2, 1, 3, 1, 4, 2
$ A3                  <dbl> 1, 1, 2, 1, 4, 1
$ A                   <dbl> 5, 3, 8, 3, 12, 3
$ B1                  <dbl> 2, 2, 3, 1, 4, 1
$ B2                  <dbl> 3, 2, 3, 1, 4, 1
$ B3                  <dbl> 2, 1, 3, 1, 4, 1
$ B4                  <dbl> 1, 2, 3, 1, 3, 2
$ B                   <dbl> 8, 7, 12, 4, 15, 5
$ C1                  <dbl> 1, 1, 3, 1, 3, 1
$ C2                  <dbl> 1, 1, 2, 1, 4, 1
$ C3                  <dbl> 2, 1, 2, 1, 4, 0
$ C                   <dbl> 4, 3, 7, 3, 11, 2
$ `Add Q`             <dbl> 0, 0, 0, 0, 0, 0
$ `Add A`             <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"
$ PR                  <dbl> 17, 13, 27, 10, 38, 10
$ D1                  <dbl> 2, 2, 2, 1, 3, 1
$ D2                  <dbl> 2, 1, 4, 1, 3, 2
$ D3                  <dbl> 1, 1, 3, 2, 4, 2
$ D4                  <dbl> 3, 2, 4, 2, 4, 2
$ D                   <dbl> 8, 6, 13, 6, 14, 7
$ E1                  <dbl> 3, 2, 3, 1, 4, 2
$ E2                  <dbl> 2, 1, 3, 1, 4, 2
$ E3                  <dbl> 1, 1, 2, 1, 3, 2
$ E                   <dbl> 6, 4, 8, 3, 11, 6
$ F1                  <dbl> 1, 1, 2, 1, 4, 1
$ F2                  <dbl> 1, 0, 2, 1, 4, 1
$ F3                  <dbl> 1, 0, 2, 2, 4, 1
$ F4                  <dbl> 1, 1, 3, 2, 3, 2
$ F                   <dbl> 4, 2, 9, 6, 15, 5
$ G1                  <dbl> 1, 0, 3, 2, 4, 1
$ G2                  <dbl> 1, 1, 2, 2, 4, 1
$ G3                  <dbl> 2, 0, 2, 2, 3, 1
$ G4                  <dbl> 1, 1, 2, 1, 4, 0
$ G                   <dbl> 5, 2, 9, 7, 15, 3
$ CL                  <dbl> 23, 14, 39, 22, 55, 21
$ Total               <dbl> 40, 27, 66, 32, 93, 31

Here you can see what I mentioned about having 209 or 210 countries depending on the year:

> count(Freedom_df, Edition) 
# A tibble: 9 x 2
  Edition     n
*   <dbl> <int>
1    2013   209
2    2014   209
3    2015   210
4    2016   210
5    2017   209
6    2018   209
7    2019   209
8    2020   210
9    2021   210

Here are two reproducible examples with the expected outputs.

Example 1: In this case I supose that there are 209 countries that always remain the same and there is just one that is added and removed.

 # A tibble: 9 x 4
  Edition     n    Added_country   Removed_country 
*   <dbl> <int>            <chr>             <chr>
1    2013   209               NA                NA
2    2014   209               NA                NA
3    2015   210   "country_name"                NA
4    2016   210               NA                NA
5    2017   209               NA    "country_name"
6    2018   209               NA                NA
7    2019   209               NA                NA
8    2020   210   "country_name"                NA
9    2021   210               NA                NA

Example 2: In this case I supose there are 207 countries that remain the same all the years (2013:2021) and 3 other countries added/removed while maintaining the same counts.

 # A tibble: 9 x 4
  Edition     n    Different_country
 *  <dbl> <int>                <chr>
 1   2013   209          "country_A"
 2   2013   209          "country_B"
 3   2014   209          "country_A"
 4   2014   209          "country_C"
 5   2015   210          "country_A"
 6   2015   210          "country_B"
 7   2015   210          "country_C"
 8   2016   210          "country_A"
 9   2016   210          "country_B"
10   2016   210          "country_C"
11   2017   209          "country_B"
12   2017   209          "country_C"
13   2018   209          "country_B"
14   2018   209          "country_C"
15   2019   209          "country_B"
16   2019   209          "country_C"
17   2020   210          "country_A"
18   2020   210          "country_B"
19   2020   210          "country_C"
20   2021   210          "country_A"
21   2021   210          "country_B"
22   2021   210          "country_C"

I think that's enough information to solve it, let me know if you need any other details. Thanks :)

Upvotes: 0

Views: 126

Answers (2)

AnilGoyal
AnilGoyal

Reputation: 26238

EDIT: since original data has been traced by @awaji98 in his/her answer, it can be seen that the following strategy works

freedom <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwUAEPTrb4AmNNYSdqCupsrXJcDOODfxTEVSZyK-yIAA2ozTGJmWLOnJHa3v-czcBitkfCx2AU_iqj/pub?gid=0&single=true&output=csv")


freedom %>% group_by(Edition) %>%
  summarise(countries = list(Country)) %>%
  mutate(removed = map2(lag(countries), countries, setdiff),
         added = map2(countries, lag(countries, default = list(countries[[1]])), setdiff)) %>%
  group_by(Edition) %>%
  mutate(added = toString(unlist(added)),
         removed = toString(unlist(removed)),
         countries = length(unlist(countries))) %>%
  ungroup()

# A tibble: 9 x 4
  Edition countries removed       added           
    <dbl>     <int> <chr>         <chr>           
1    2013       209 ""            ""              
2    2014       209 ""            ""              
3    2015       210 ""            "Crimea"        
4    2016       210 ""            ""              
5    2017       209 "Puerto Rico" ""              
6    2018       209 ""            ""              
7    2019       209 ""            ""              
8    2020       210 ""            "Eastern Donbas"
9    2021       210 ""            ""

Old answer

Let's first construct a data say freedom, since you have not added any

library(tidyverse)

set.seed(2021)
freedom <- data.frame(year = rep(2010:2014, each = 5),
                      country = c(sample(LETTERS[1:8], 5), 
                                  sample(LETTERS[1:8], 5), 
                                  sample(LETTERS[1:8], 5), 
                                  sample(LETTERS[1:8], 5), 
                                  sample(LETTERS[1:8], 5)),
                      val = round(100 * runif(25)))

freedom %>% pivot_wider(id_cols = country, names_from = year, names_sort = T, values_from = val) %>%
  arrange(country)
#Let's have a look on this data
# A tibble: 8 x 6
  country `2010` `2011` `2012` `2013` `2014`
  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 A           40     NA     76     52     63
2 B           82     NA     73     NA     20
3 C           NA     90     NA     NA     43
4 D           NA     52     NA     NA     54
5 E           NA     93     29     36     NA
6 F           21     NA     NA     23     NA
7 G           49     23     31      1     NA
8 H           68     62     70     88     17

Now the below syntax will give a list of year-wise added and removed countries. All countries will be treated as added during first year

freedom %>% group_by(year) %>%
  summarise(countries = list(country)) %>%
  mutate(removed = map2(lag(countries), countries, setdiff),
         added = map2(countries, lag(countries), setdiff)) %>%
  select(-countries) %>%
  unnest(c(added, removed))

# A tibble: 14 x 3
    year removed added
   <int> <chr>   <chr>
 1  2010 NA      B    
 2  2010 NA      G    
 3  2010 NA      A    
 4  2010 NA      F    
 5  2010 NA      H    
 6  2011 B       E    
 7  2011 A       D    
 8  2011 F       C    
 9  2012 D       B    
10  2012 C       A    
11  2013 B       F    
12  2014 G       C    
13  2014 F       D    
14  2014 E       B

Or if don't want to see additions during first year, do this

freedom %>% group_by(year) %>%
  summarise(countries = list(country)) %>%
  #group_by(year) %>%
  mutate(removed = map2(lag(countries), countries, setdiff),
         added = map2(countries, lag(countries), setdiff)) %>%
  select(-countries) %>%
  filter(as.numeric(row_number()) != 1) %>%
  unnest(c(added, removed))

# A tibble: 9 x 3
   year removed added
  <int> <chr>   <chr>
1  2011 B       E    
2  2011 A       D    
3  2011 F       C    
4  2012 D       B    
5  2012 C       A    
6  2013 B       F    
7  2014 G       C    
8  2014 F       D    
9  2014 E       B

Or if you just want to see how many added and removed , do this

freedom %>% group_by(year) %>%
  summarise(countries = list(country)) %>%
  #group_by(year) %>%
  mutate(removed = unlist(map2(lag(countries), countries, function(x, y) length(setdiff(x, y)))),
         added = unlist(map2(countries, lag(countries), function(x, y) length(setdiff(x, y))))) %>%
  select(-countries)

# A tibble: 5 x 3
   year removed added
  <int>   <int> <int>
1  2010       0     5
2  2011       3     3
3  2012       2     2
4  2013       1     1
5  2014       3     3

further update/edit As you have stated the countries count each year is not same, you may also adopt the following strategy

#let's edit the df to have unequal country count each year
freedom <- freedom[-c(9,21),]

#now
freedom %>% group_by(year) %>%
  summarise(countries = list(country)) %>%
  mutate(removed = map2(lag(countries), countries, setdiff),
         added = map2(countries, lag(countries), setdiff)) %>%
  group_by(year) %>%
  mutate(added = toString(unlist(added)),
         removed = toString(unlist(removed)),
         countries = length(unlist(countries))) %>%
  ungroup()

# A tibble: 5 x 4
   year countries removed   added        
  <int>     <int> <chr>     <chr>        
1  2010         5 ""        B, G, A, F, H
2  2011         4 "B, A, F" E, D         
3  2012         5 "D"       B, A         
4  2013         5 "B"       F            
5  2014         4 "G, F, E" D, B

I think now you can safely have an outlook on your data for additions/removals each year.

Upvotes: 0

awaji98
awaji98

Reputation: 685

Final Edit using the real data

Sourced real data here:

Freedom_df <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwUAEPTrb4AmNNYSdqCupsrXJcDOODfxTEVSZyK-yIAA2ozTGJmWLOnJHa3v-czcBitkfCx2AU_iqj/pub?gid=0&single=true&output=csv")

Applying the code from second edit:

Freedom_df %>% mutate(Country = factor(Country)) %>% 
count(Edition, Country, .drop = FALSE) %>% filter(n == 0)

Result:

  Edition Country            n
     <dbl> <fct>          <int>
 1    2013 Crimea             0
 2    2013 Eastern Donbas     0
 3    2014 Crimea             0
 4    2014 Eastern Donbas     0
 5    2015 Eastern Donbas     0
 6    2016 Eastern Donbas     0
 7    2017 Eastern Donbas     0
 8    2017 Puerto Rico        0
 9    2018 Eastern Donbas     0
10    2018 Puerto Rico        0
11    2019 Eastern Donbas     0
12    2019 Puerto Rico        0
13    2020 Puerto Rico        0
14    2021 Puerto Rico        0

Just to clarify which years these three countries are present, we can create a vector of them and call it missing:

missing <- Freedom_df %>% mutate(Country = factor(Country)) %>% 
count(Edition, Country, .drop = FALSE) %>% filter(n == 0 )%>% 
pull %>% as.character

Then see which years the missing countries are present in the date

Freedom_df %>% filter(Country %in% missing) %>% 
select(Country,Edition) %>% arrange(Country,Edition)

Result:

# A tibble: 13 x 2
   Country        Edition
   <chr>            <dbl>
 1 Crimea            2015
 2 Crimea            2016
 3 Crimea            2017
 4 Crimea            2018
 5 Crimea            2019
 6 Crimea            2020
 7 Crimea            2021
 8 Eastern Donbas    2020
 9 Eastern Donbas    2021
10 Puerto Rico       2013
11 Puerto Rico       2014
12 Puerto Rico       2015
13 Puerto Rico       2016

Second Edit based on OP's example

I'm not sure what your other column names are, but assuming you have Edition and Country for the year and Country/Territory respectively, you could try using count with Edition set as a factor variable.

Freedom_df %>% 
# change country column to factor
mutate(Country = factor(`Country/Territory`)) %>%
# count with .drop = FALSE to add 0's for any missing countries 
count(Edition, Country, .drop = FALSE) %>%
# filter to find them
filter(n == 0)

Example:

# Example dataframe
df <- structure(list(Edition = c(2021, 2021, 2021, 2021, 2021, 2021, 
2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2019, 2019, 2018, 
2018, 2018, 2018, 2018, 2018), Country = c("Barbados", "Belarus", 
"Belgium", "Belize", "Benin", "Bhutan", "Belarus", "Barbados", 
"Belize", "Bhutan", "Benin", "Belgium", "Barbados", "Belize", 
"Belarus", "Bhutan", "Barbados", "Belarus", "Belgium", "Belize", 
"Benin", "Bhutan")), row.names = c(NA, -22L), class = c("tbl_df", 
"tbl", "data.frame"))

# the code
 df %>% mutate(Country = factor(Country)) %>% 
    count(Edition, Country, .drop = FALSE) %>% 
    filter(n == 0) 

Results in:

  Edition Country     n
    <dbl> <fct>   <int>
1    2019 Benin       0
2    2020 Belgium     0

First edit If all the data are in one dataframe, you can split the dataframe by year first, and then use the anti_join on the list objects as above.

## split the dataframe by year
df <- df %>% group_split(year) %>% setNames(unique(df$year))
## do the anti_join with smaller subset last
anti_join(df$`2021`,df$`2020`, by = "country")

Here is an example of it in practice:

## An example df (reproduced by dput())
df <- structure(list(year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 
2020, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 
2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021), 
    country = c("Abkhazia*", "Afghanistan", "Albania", "Algeria", 
    "Andorra", "Angola", "Antigua and Barbuda", "Argentina", 
    "Armenia", "Australia", "Austria", "Azerbaijan", "Bangladesh", 
    "Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan", 
    "Abkhazia*", "Afghanistan", "Albania", "Algeria", "Andorra", 
    "Angola", "Antigua and Barbuda", "Argentina", "Armenia", 
    "Australia", "Austria", "Azerbaijan", "Bahrain", "Bangladesh", 
    "Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan"
    ), score = c("40", "27", "66", "32", "93", "31", "85", "84", 
    "55", "97", "93", "10", "39", "95", "11", "96", "87", "65", 
    "61", "40", "27", "66", "32", "93", "31", "85", "84", "55", 
    "97", "93", "10", "12", "39", "95", "11", "96", "87", "65", 
    "61")), row.names = c(NA, -39L), class = c("tbl_df", "tbl", 
"data.frame"))

This is the df:

 year country             score
   <dbl> <chr>               <chr>
 1  2020 Abkhazia*           40   
 2  2020 Afghanistan         27   
 3  2020 Albania             66   
 4  2020 Algeria             32   
 5  2020 Andorra             93   
 6  2020 Angola              31   
 7  2020 Antigua and Barbuda 85   
 8  2020 Argentina           84   
 9  2020 Armenia             55   
10  2020 Australia           97   
# … with 29 more rows

## split the dataframe by year
df <- df %>% group_split(year) %>% setNames(unique(df$year))

Results in:

   $`2020`
# A tibble: 19 x 3
    year country             score
   <dbl> <chr>               <chr>
 1  2020 Abkhazia*           40   
 2  2020 Afghanistan         27   
 3  2020 Albania             66   
 4  2020 Algeria             32   
 5  2020 Andorra             93   
 6  2020 Angola              31   
 7  2020 Antigua and Barbuda 85   
 8  2020 Argentina           84   
 9  2020 Armenia             55   
10  2020 Australia           97   
11  2020 Austria             93   
12  2020 Azerbaijan          10   
13  2020 Bangladesh          39   
14  2020 Barbados            95   
15  2020 Belarus             11   
16  2020 Belgium             96   
17  2020 Belize              87   
18  2020 Benin               65   
19  2020 Bhutan              61   

$`2021`
# A tibble: 20 x 3
    year country             score
   <dbl> <chr>               <chr>
 1  2021 Abkhazia*           40   
 2  2021 Afghanistan         27   
 3  2021 Albania             66   
 4  2021 Algeria             32   
 5  2021 Andorra             93   
 6  2021 Angola              31   
 7  2021 Antigua and Barbuda 85   
 8  2021 Argentina           84   
 9  2021 Armenia             55   
10  2021 Australia           97   
11  2021 Austria             93   
12  2021 Azerbaijan          10   
13  2021 Bahrain             12   
14  2021 Bangladesh          39   
15  2021 Barbados            95   
16  2021 Belarus             11   
17  2021 Belgium             96   
18  2021 Belize              87   
19  2021 Benin               65   
20  2021 Bhutan              61  

Then do the anti_join

## do the anti_join with smaller subset last
anti_join(df$`2021`,df$`2020`, by = "country")

Result:

# A tibble: 1 x 3
   year country score
  <dbl> <chr>   <chr>
1  2021 Bahrain 12  

Upvotes: 0

Related Questions