rg255
rg255

Reputation: 4169

String split in R with complex divisions

I have a df (day.df) with the column vial which I am trying to split in to four new columns. The new columns will be treatment gender line block. The day.df dataframe also has the columns response & explanatory which will be retained.

So day.df currently looks like this (top 4 of 31000 rows):

    vial    response explanatory
    Xm1.1   0        4
    Xm2.1   0        4
    Xm3.1   0        4
    Xm4.1   0        4
    .       .        .
    .       .        .        
    .       .        .

The current contents of the vial column look like this.. Xm1.2.

As such the new day.df will look something like this (I use four "random" rows to illustrate the variation within each new column):

        vial    response explanatory  treatment gender line  block
        Xm1.1   0        4            X         m      1     1
        Am1.1   0        4            A         m      1     1
        Xf3.2   0        4            X         f      3     2
        Xm4.2   0        4            X         m      4     2
        .       .        .
        .       .        .        
        .       .        .

I've taken a look around online for how to do this and this is the closest I came; I tried to split the vial column like this...

 > a=strsplit(day.df$vial,"")
 > a[1] "Xm1.2"

but had problems when the "line" section of the string went >9 because then two character were there, e.g (for the row where vial is Af20.2).

 > a[300]
 [[1]]
 [1] "A" "f" "2" "0" "." "2"

Should read as:

 > a[300]
 [[1]]
 [1] "A" "f" "20" "." "2"



So the steps I need help solving are:

  1. Overcome the problem with the line section of the string when over 9.
  2. Add the list of the split string to the day.df dataframe in the four required columns

Upvotes: 7

Views: 338

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269526

Read the data:

Lines <- "vial    response explanatory
    Xm1.1   0        4
    Xm2.1   0        4
    Xm3.1   0        4
    Xm4.1   0        4
"

day.df <- read.table(text = Lines, header = TRUE, as.is = TRUE)

1) then process it using strapplyc. (we used as.is=TRUE so that day.df$vial is character but if its a factor in your data frame then replace day.df$Vial with as.character(day.df$vial). ) This approach does the parsing in just one short line of code:

library(gsubfn)    
s <- strapplyc(day.df$vial, "(.)(.)(\\d+)[.](.)", simplify = rbind)

# we can now cbind it to the original data frame
colnames(s) <- c("treatment", "gender", "line", "block")
cbind(day.df, s)

which gives:

  vial response explanatory treatment gender line block
1 Xm1.1        0           4         X      m    1     1
2 Xm2.1        0           4         X      m    2     1
3 Xm3.1        0           4         X      m    3     1
4 Xm4.1        0           4         X      m    4     1

2) Here is a different approach. This does not use any packages and is relatively simple (no regular expressions at all) and only involves one R statement including the cbind'ing:

transform(day.df,
 treatment = substring(vial, 1, 1),        # 1st char
 gender = substring(vial, 2, 2),           # 2nd char
 line = substring(vial, 3, nchar(vial)-2), # 3rd through 2 prior to last char
 block = substring(vial, nchar(vial)))     # last char

The result is as before.

UPDATE: Added second approach.

UPDATE: Some simplifications.

Upvotes: 4

agstudy
agstudy

Reputation: 121568

using gsub and strsplit like this :

v <- c('Xm1.1','Xf3.2')
h <- gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])','\\1|\\2|\\3|\\4',v)
do.call(rbind,strsplit(h,'[|]'))

    [,1] [,2] [,3] [,4]
[1,] "X"  "m"  "1"  "1" 
[2,] "X"  "f"  "3"  "2" 

the result it is a data.frame, you can cbind it to your original data.frame.

EDIT @GriffinEvo Applied & tested code:

 a = gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])',
           '\\1|\\2|\\3|\\4',day.df$vial) 

 do.call(rbind, strsplit(a,'[|]') )
 day.df = cbind(day.df,do.call(rbind,strsplit(a,'[|]'))) 
 colnames(day.df)[4:7] = c ("treatment" , "gender" , "line" , "block")

Upvotes: 8

Maxim.K
Maxim.K

Reputation: 4180

An alternative way that does not require the use of regular expressions is to use substr() in combination with the fact that the last part of your code is a numeric value.

Let's say your data is this:

d1 <- read.table(header=TRUE,text="
    vial    response explanatory
    Xm1.1   0        4
    Xm2.1   0        4
    Xm3.2   0        4
    Xm44.1   0        4")

Then the result can be achieved by:

d1$line <- as.integer(substr(x=d1$vial,3,6))
d1$block <- (as.numeric(substr(x=d1$vial,3,6)) %% 1)*10
d1$treatment <- substr(x=d1$vial,1,1)
d1$gender <- substr(x=d1$vial,2,2)

The numeric part begins always after exactly two symbols, regardless of the number of digits. We extract that part, and take digits before the decimal in the first line, and digits after the decimal in the second line. Extracting treatment and gender is straightforward.

Upvotes: 1

Related Questions