littleworth
littleworth

Reputation: 5169

How to exclude rows based on grepl special character using dplyr piping

I have the following data frame:

library(tidyverse)
ndf <- structure(list(experiment_status = c("Negative?", "Negative?", 
"Negative", "Negative?", "Negative?", "Negative?"), id = 1:6), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L))

ndf
#> # A tibble: 6 x 2
#>   experiment_status    id
#>   <chr>             <int>
#> 1 Negative?            1
#> 2 Negative?            2
#> 3 Negative              3
#> 4 Negative?            4
#> 5 Negative?            5
#> 6 Negative?            6

What I want to do is to filter the rows keeping only those without a question mark ?, i.e. only row 3 is preserved after pipe.

Why did this fail?

  ndf %>% 
    filter(!grepl("[?]", experiment_status))

What's the right way to do it?

Upvotes: 0

Views: 908

Answers (4)

moodymudskipper
moodymudskipper

Reputation: 47300

To clean your interrogation marks you can use stringi::stri_trans_general. I'd suggest you use it as early as possible on your data to avoid bad surprises.

library(stringi)
ndf %>%
  mutate_at("experiment_status", stri_trans_general, "latin-ascii") %>%
  filter(!grepl("[?]", experiment_status)) # or filter(!grepl("\\?$", experiment_status))

# A tibble: 1 x 2
#     experiment_status    id
#                 <chr> <int>
#   1          Negative     3

Here no knowledge about the problematic character is needed, and you might clean by the same token other unfortunate punctuation signs or alternate characters.

Upvotes: 1

Onyambu
Onyambu

Reputation: 79208

 ndf %>% 
     filter(!grepl(intToUtf8(65311), experiment_status))
# A tibble: 1 x 2
  experiment_status    id
  <chr>             <int>
1 Negative              3

One thing you also notice is if you coerce the tibble to dataframe, it will give you its hex-Unicode which is <U+FF1F>. You can also use this to filter.

ie:

ndf %>% 
     filter(!grepl(intToUtf8(0xFF1F), experiment_status))
# A tibble: 1 x 2
  experiment_status    id
  <chr>             <int>
1 Negative              3

Upvotes: 2

A. Suliman
A. Suliman

Reputation: 13125

Probably there is a problem happened during import the csv file which is written in a non-English OS.

> '?' =='?'
[1] FALSE

ndf %>% filter(!grepl('?',experiment_status))

#Try removing white space but it fails
> trimws(ndf$experiment_status,'both')
[1] "Negative?" "Negative?" "Negative"   "Negative?" "Negative?" "Negative?"
#Change '?' to '?' using gsub
> gsub('?', '?', ndf$experiment_status)
[1] "Negative?" "Negative?" "Negative"  "Negative?" "Negative?" "Negative?"


ndf %>% mutate(experiment_status_clean = gsub('?', '?', experiment_status))

#Now you are search for a litteral ? so you need to escape ? using \\
ndf %>% mutate(experiment_status_clean = gsub('?', '?', experiment_status)) %>% 
        filter(!grepl('\\?',experiment_status_clean))

Upvotes: 1

neilfws
neilfws

Reputation: 33782

ndf %>% 
  filter(!grepl("?", experiment_status, fixed = TRUE))

But in your example I think filter(experiment_status == "Negative") would work too.

EDIT: or since we can have "Positive" too -

ndf %>% 
  filter(experiment_status %in% c("Negative", "Positive"))

Upvotes: 1

Related Questions