Reputation: 321

Split a column in R with inconstant data format

I have an R dataframe that has 17 columns. One column contains the unique identifiers that I will use to merge with other dataframes. However, some rows in this column contains extra data which makes merging not possible. Here is a subset of different types of data I'm looking at.

M2017013708-MN-M02199-180405
M201701492756-MN-M05144-180419
M2016019446_S3_L001
M2016019762

All data after the -MN is considered extra data that needs to be removed. My goal is to add a new column to the dataframe without the extra data. It would look like this:

M2017013708
M201701492756
M2016019446_S3_L001
M2016019762

I've tried to split the data at -MN which makes a list then make it into a dataframe with ldply. However, this results in an error because the split causes a list of multiple lengths since not all rows have a -MN.

split_my_data <- strsplit(my_data$sample_name, '-MN')
df <- ldply(split_my_data)

I tried using using a case and regualr expression with sql with sqldf. However, I get an error of no such function REGEXP.

Any help would be greatly appreciated.

Upvotes: 0

Answers (3)

G. Grothendieck

Reputation: 269491

SQLite

Regarding SQLite, regular expressions are only available if regular expression support is turned on when SQLite is built but RSQLite did not do that so it is not available.

What you can do is append -MN- onto the end of each string to ensure that there is always at least one occurrence and then search for it using instr and take the substring to that point using substr:

library(sqldf)
sqldf("select V1, substr(V1, 1, instr(V1 || '-MN-', '-MN-') - 1) as V2 from DF")

giving:

                              V1                  V2
1   M2017013708-MN-M02199-180405         M2017013708
2 M201701492756-MN-M05144-180419       M201701492756
3            M2016019446_S3_L001 M2016019446_S3_L001
4                    M2016019762         M2016019762

H2

If we use the H2 backend to sqldf instead of SQLite then we can use regular expressions. The RH2 package includes both the R driver and H2 itself and if it is loaded sqldf will assume you wanted to use it instead of SQLite. The order in which RH2 and sqldf are loaded does not matter.

library(RH2)
library(sqldf)

sqldf("select V1, regexp_replace(V1, '-MN-.*', '') as V2 from DF")

Note

The input in reproducible form is:

DF <- data.frame(V1 = c("M2017013708-MN-M02199-180405",
                        "M201701492756-MN-M05144-180419",
                        "M2016019446_S3_L001",
                        "M2016019762"))

Upvotes: 1

rjen

Reputation: 1972

A simple tidy solution could also be:

library(dplyr)
library(stringr)

data <- tibble(dirty = c('M2017013708-MN-M02199-180405',
                         'M201701492756-MN-M05144-180419',
                         'M2016019446_S3_L001',
                         'M2016019762'))

data %>%
  mutate(clean = str_remove(dirty, pattern = '-MN.*'))

# A tibble: 4 x 2
  dirty                          clean              
  <chr>                          <chr>              
1 M2017013708-MN-M02199-180405   M2017013708        
2 M201701492756-MN-M05144-180419 M201701492756      
3 M2016019446_S3_L001            M2016019446_S3_L001
4 M2016019762                    M2016019762

Upvotes: 1

Mike V

Reputation: 1364

Or you can try this method using Look Beind Regex (?<=)

df <- data.frame(OBS = 1:4, 
                 CODE = c("M2017013708-MN-M02199-180405",
                             "M201701492756-MN-M05144-180419",
                             "M2016019446_S3_L001",
                             "M2016019762"))
df2 <- df %>% 
  mutate(CODE2 = str_replace_all(CODE, regex("(?<=)-MN.*"), ""))
# OBS                           CODE               CODE2
# 1   1   M2017013708-MN-M02199-180405         M2017013708
# 2   2 M201701492756-MN-M05144-180419       M201701492756
# 3   3            M2016019446_S3_L001 M2016019446_S3_L001
# 4   4                    M2016019762         M2016019762

Upvotes: 1

Split a column in R with inconstant data format

Answers (3)

SQLite

H2

Note

Related Questions