french_fries
french_fries

Reputation: 1

Subset complicated dataset

I have a dataset:

x     y
A1  '*****'
A2  '123,0'
A3  '4557777'
A4  '8756784321675'
A5  'ЯРН'
A6  ''
A7  
A8

I want to remove all rows except those whose values in column y consist only of numbers or are empty(not '', but NA). So the desired output is this:

x     y
A3  '4557777'
A4  '8756784321675'
A7  
A8

How could I do that? As you see there is a value with cyrillic letters 'ЯРН'. Thats the most unclear part. With comma i tried this:

df[!grepl(",", df$y),]

But what to do with others

Upvotes: 0

Views: 42

Answers (5)

akrun
akrun

Reputation: 887901

Using dplyr

library(dplyr)
library(stringr)
df %>%
   filter(str_detect(y, '^[0-9]+$')|is.na(y))

Upvotes: 0

Ian Campbell
Ian Campbell

Reputation: 24878

Here is an approach with base R:

You use the pattern "'\\d+'" which looks for a string that starts with ', has one or more digits and then ends with '.

df[grepl("'\\d+'",df$y) | is.na(df$y),]
   x             y
3 A3       '4557777'
4 A4 '8756784321675'
7 A7           <NA>
8 A8           <NA>

Upvotes: 1

Onyambu
Onyambu

Reputation: 79338

since your data seems to contain ' you could do:

subset(df, grepl("^'?\\d+'?$",y) | is.na(y))

Upvotes: 1

BellmanEqn
BellmanEqn

Reputation: 799

suppressWarnings(df[!is.na(as.numeric(df$y)) | is.na(df$y), ])

Upvotes: 1

Rui Barradas
Rui Barradas

Reputation: 76653

The following solution works even with cyrillic symbols, it takes advantage of the fact that to coerce a string that is not all numbers produces NA's.

tmp <- gsub("'", "", df1$y)
i <- !is.na(suppressWarnings(as.numeric(tmp)))
j <- df1$y == "" | is.na(df1$y)
df1[i | j, ]
#   x               y
#3 A3       '4557777'
#4 A4 '8756784321675'
#7 A7                
#8 A8    

Data

df1 <- read.table(text = "
x     y
A1  '*****'
A2  '123,0'
A3  '4557777'
A4  '8756784321675'
A5  'ЯРН'
A6  ''
A7  
A8
", header = TRUE, fill = TRUE, quote = "\"")

Upvotes: 1

Related Questions