sneeze_shiny
sneeze_shiny

Reputation: 328

Why are not all rows are filtered when filtering a dataset using dplyr?

I have a very large dataset, and am trying to filter out specific rows in my dataset. Here is a link to the public dataset. This dataset can be downloaded for example purposes.

Here is the code that I used:

library(readxl)
library(dplyr)
library(tidyverse)

#Set the working directory
setwd('path\\to\\file.xlsx')


data <- read_excel("dataset_for_stack_question.xlsx")

unique(data$country_of_interest)

Using the list provided from the unique command, I tried to filter all the regions that I didn't want using the filter command:

filtered <- data %>% 
  filter(country_of_interest != c("Middle Africa", 
                                  "Eastern Africa",
                                  "Western Africa",
                                  "Northern Africa",
                                  "Western Asia",
                                  "Central Asia",
                                  "Southern Asia",
                                  "Eastern Asia",
                                  "Central America",
                                  "Australia / New Zealand",
                                  "Eastern Europe",
                                  "Northern Europe",
                                  "Southern Europe",
                                  "Western Europe",
                                  "More developed regions",
                                  "Less developed regions",
                                  "Least developed countries",
                                  "Less developed regions, excluding least developed countries",
                                  "High-income countries",
                                  "Middle-income countries",
                                  "Upper-middle-income countries",
                                  "Lower-middle-income countries",
                                 "Low-income countries",
                                 "No income group available",
                                 "Africa",
                                 "Asia",
                                 "Europe",
                                 "Latin America and the Caribbean",
                                 "Northern America",
                                 "Oceania"))

However, when I run the head(filtered, 20) command, I see that there are still some columns that were not filtered:

#Output of the head(filtered, 20) command:

# A tibble: 20 x 4
   Year  country_of_interest country             migration
   <chr> <chr>               <chr>                   <dbl>
 1 1990  Eastern Africa      Afghanistan                 0
 2 1990  Eastern Africa      American Samoa              0
 3 1990  Eastern Africa      Andorra                     0
 4 1990  Eastern Africa      Angola                 139108
 5 1990  Eastern Africa      Anguilla                    0
 6 1990  Eastern Africa      Antigua and Barbuda         0
 7 1990  Eastern Africa      Argentina                   0
 8 1990  Eastern Africa      Armenia                     0
 9 1990  Eastern Africa      Aruba                       0
10 1990  Eastern Africa      Australia                 148
11 1990  Eastern Africa      Austria                     0
12 1990  Eastern Africa      Azerbaijan                  0
13 1990  Eastern Africa      Bahamas                     0
14 1990  Eastern Africa      Bahrain                     0
15 1990  Eastern Africa      Bangladesh                131
16 1990  Eastern Africa      Barbados                    0
17 1990  Eastern Africa      Belarus                     0
18 1990  Eastern Africa      Belgium                   794
19 1990  Eastern Africa      Belize                      0
20 1990  Eastern Africa      Benin                       0

As per the previous filter code, "Eastern Africa" should've been filtered. Additionally, there were other criteria that were supposed to be filtered, that were not. How can I ensure that all of the data is filtered, like it should be?

Upvotes: 0

Views: 85

Answers (1)

sneeze_shiny
sneeze_shiny

Reputation: 328

As stated by @akrun, the solution was to use the !country_of_interest %in% format, as below:

filtered <- data %>% 
  filter(!country_of_interest %in% c("Middle Africa", 
                                  "Eastern Africa",
                                  "Western Africa",
                                  "Northern Africa",
                                  "Western Asia",
                                  "Central Asia",
                                  "Southern Asia",
                                  "Eastern Asia",
                                  "Central America",
                                  "Australia / New Zealand",
                                  "Eastern Europe",
                                  "Northern Europe",
                                  "Southern Europe",
                                  "Western Europe",
                                  "More developed regions",
                                  "Less developed regions",
                                  "Least developed countries",
                                  "Less developed regions, excluding least developed countries",
                                  "High-income countries",
                                  "Middle-income countries",
                                  "Upper-middle-income countries",
                                  "Lower-middle-income countries",
                                 "Low-income countries",
                                 "No income group available",
                                 "Africa",
                                 "Asia",
                                 "Europe",
                                 "Latin America and the Caribbean",
                                 "Northern America",
                                 "Oceania"))

Upvotes: 1

Related Questions