Reputation: 11
I have a large data set with one column that includes both characters (i.e. "A", "B", etc) and numbers, but the numbers are read-in and assumed to be characters as well. I want to get rid of all rows where the cell for this column is a number. For simplicity, I will show just a mock vector representing the issue I am having with the column.
For example,
data<-c("A","A","B","B","1","2","-2")
This is data I inherited and a large data set - is there a good way to parse out/drop the cells with the numbers 1,2,-2 which are read-in as characters?
Thanks for the help.
Upvotes: 0
Views: 398
Reputation: 70326
A simple option would be:
data <- droplevels(data[is.na(suppressWarnings(as.numeric(data$col))), ])
Convert the column (col) to numeric and subset those values that turned to NA (which means that they are not numbers). Then, drop factor levels that are no longer in use.
Some example usages:
v1 <- c('A12', 'AB12', '-2.53', '25.29', 'BCd')
v1[is.na(suppressWarnings(as.numeric(v1)))]
#[1] "A12" "AB12" "BCd"
Or with special characters:
v1 <- c('A_12', 'AB12', '-2.53', '25.29', 'B-Cd')
v1[is.na(suppressWarnings(as.numeric(v1)))]
#[1] "A_12" "AB12" "B-Cd"
Upvotes: 1
Reputation: 887711
One simple regex
option is below. Here, I am subsetting the dataset using grepl
by removing those elements that have numbers starting from beginning (^
) to end ($
) of the string.
subdat <- droplevels(data[!grepl('^[0-9.-]+$', data$yourCol),])
Visualization
^[0-9.-]+$
If the column is factor
, you can use droplevels
to drop the levels or can use factor
again to drop the "unused" levels. Then, check "yourCol" of "data" by levels(data$yourCol)
. Another option is to convert to "character" column by data$yourCol <- as.character(data$yourCol)
and use unique(data$yourCol)
Testing with some example data
v1 <- c('A12', 'AB12', '-2.53', '25.29', 'BCd', '-12AB5', '-AB125', '- ')
v1[!grepl('^[0-9.-]+$', v1)]
#[1] "A12" "AB12" "BCd" "-12AB5" "-AB125" "- "
Doublechecking with @docendodiscimus code
v1[is.na(suppressWarnings(as.numeric(v1)))]
#[1] "A12" "AB12" "BCd" "-12AB5" "-AB125" "- "
NOTE: I did update the regex
after finding that the initial one may not work in some cases.
Upvotes: 0