Reputation: 1
I have a dataset:
x y
A1 '*****'
A2 '123,0'
A3 '4557777'
A4 '8756784321675'
A5 'ЯРН'
A6 ''
A7
A8
I want to remove all rows except those whose values in column y consist only of numbers or are empty(not '', but NA). So the desired output is this:
x y
A3 '4557777'
A4 '8756784321675'
A7
A8
How could I do that? As you see there is a value with cyrillic letters 'ЯРН'. Thats the most unclear part. With comma i tried this:
df[!grepl(",", df$y),]
But what to do with others
Upvotes: 0
Views: 42
Reputation: 887901
Using dplyr
library(dplyr)
library(stringr)
df %>%
filter(str_detect(y, '^[0-9]+$')|is.na(y))
Upvotes: 0
Reputation: 24878
Here is an approach with base R:
You use the pattern "'\\d+'"
which looks for a string that starts with '
, has one or more digits and then ends with '
.
df[grepl("'\\d+'",df$y) | is.na(df$y),]
x y
3 A3 '4557777'
4 A4 '8756784321675'
7 A7 <NA>
8 A8 <NA>
Upvotes: 1
Reputation: 79338
since your data seems to contain '
you could do:
subset(df, grepl("^'?\\d+'?$",y) | is.na(y))
Upvotes: 1
Reputation: 76653
The following solution works even with cyrillic symbols, it takes advantage of the fact that to coerce a string that is not all numbers produces NA
's.
tmp <- gsub("'", "", df1$y)
i <- !is.na(suppressWarnings(as.numeric(tmp)))
j <- df1$y == "" | is.na(df1$y)
df1[i | j, ]
# x y
#3 A3 '4557777'
#4 A4 '8756784321675'
#7 A7
#8 A8
Data
df1 <- read.table(text = "
x y
A1 '*****'
A2 '123,0'
A3 '4557777'
A4 '8756784321675'
A5 'ЯРН'
A6 ''
A7
A8
", header = TRUE, fill = TRUE, quote = "\"")
Upvotes: 1