Reputation: 321
I have an R dataframe that has 17 columns. One column contains the unique identifiers that I will use to merge with other dataframes. However, some rows in this column contains extra data which makes merging not possible. Here is a subset of different types of data I'm looking at.
M2017013708-MN-M02199-180405
M201701492756-MN-M05144-180419
M2016019446_S3_L001
M2016019762
All data after the -MN is considered extra data that needs to be removed. My goal is to add a new column to the dataframe without the extra data. It would look like this:
M2017013708
M201701492756
M2016019446_S3_L001
M2016019762
I've tried to split the data at -MN which makes a list then make it into a dataframe with ldply. However, this results in an error because the split causes a list of multiple lengths since not all rows have a -MN.
split_my_data <- strsplit(my_data$sample_name, '-MN')
df <- ldply(split_my_data)
I tried using using a case and regualr expression with sql with sqldf. However, I get an error of no such function REGEXP.
Any help would be greatly appreciated.
Upvotes: 0
Views: 59
Reputation: 269491
Regarding SQLite, regular expressions are only available if regular expression support is turned on when SQLite is built but RSQLite did not do that so it is not available.
What you can do is append -MN-
onto the end of each string to ensure that there is always at least one occurrence and then search for it using instr
and take the substring to that point using substr
:
library(sqldf)
sqldf("select V1, substr(V1, 1, instr(V1 || '-MN-', '-MN-') - 1) as V2 from DF")
giving:
V1 V2
1 M2017013708-MN-M02199-180405 M2017013708
2 M201701492756-MN-M05144-180419 M201701492756
3 M2016019446_S3_L001 M2016019446_S3_L001
4 M2016019762 M2016019762
If we use the H2 backend to sqldf instead of SQLite then we can use regular expressions. The RH2 package includes both the R driver and H2 itself and if it is loaded sqldf will assume you wanted to use it instead of SQLite. The order in which RH2 and sqldf are loaded does not matter.
library(RH2)
library(sqldf)
sqldf("select V1, regexp_replace(V1, '-MN-.*', '') as V2 from DF")
The input in reproducible form is:
DF <- data.frame(V1 = c("M2017013708-MN-M02199-180405",
"M201701492756-MN-M05144-180419",
"M2016019446_S3_L001",
"M2016019762"))
Upvotes: 1
Reputation: 1972
A simple tidy solution could also be:
library(dplyr)
library(stringr)
data <- tibble(dirty = c('M2017013708-MN-M02199-180405',
'M201701492756-MN-M05144-180419',
'M2016019446_S3_L001',
'M2016019762'))
data %>%
mutate(clean = str_remove(dirty, pattern = '-MN.*'))
# A tibble: 4 x 2
dirty clean
<chr> <chr>
1 M2017013708-MN-M02199-180405 M2017013708
2 M201701492756-MN-M05144-180419 M201701492756
3 M2016019446_S3_L001 M2016019446_S3_L001
4 M2016019762 M2016019762
Upvotes: 1
Reputation: 1364
Or you can try this method using Look Beind Regex (?<=)
df <- data.frame(OBS = 1:4,
CODE = c("M2017013708-MN-M02199-180405",
"M201701492756-MN-M05144-180419",
"M2016019446_S3_L001",
"M2016019762"))
df2 <- df %>%
mutate(CODE2 = str_replace_all(CODE, regex("(?<=)-MN.*"), ""))
# OBS CODE CODE2
# 1 1 M2017013708-MN-M02199-180405 M2017013708
# 2 2 M201701492756-MN-M05144-180419 M201701492756
# 3 3 M2016019446_S3_L001 M2016019446_S3_L001
# 4 4 M2016019762 M2016019762
Upvotes: 1