Reputation: 4169
I have a df (day.df
) with the column vial
which I am trying to split in to four new columns. The new columns will be treatment
gender
line
block
. The day.df
dataframe also has the columns response
& explanatory
which will be retained.
So day.df
currently looks like this (top 4 of 31000 rows):
vial response explanatory
Xm1.1 0 4
Xm2.1 0 4
Xm3.1 0 4
Xm4.1 0 4
. . .
. . .
. . .
The current contents of the vial
column look like this.. Xm1.2
.
X
or A
- this will be the
treament
. m
) can be m
or
f
- this is the gender
.1
) and ranges from 1
-40
- this
is the line
.block
and ranges from 1
-4
As such the new day.df
will look something like this (I use four "random" rows to illustrate the variation within each new column):
vial response explanatory treatment gender line block
Xm1.1 0 4 X m 1 1
Am1.1 0 4 A m 1 1
Xf3.2 0 4 X f 3 2
Xm4.2 0 4 X m 4 2
. . .
. . .
. . .
I've taken a look around online for how to do this and this is the closest I came; I tried to split the vial
column like this...
> a=strsplit(day.df$vial,"")
> a[1] "Xm1.2"
but had problems when the "line" section of the string went >9 because then two character were there, e.g (for the row where vial
is Af20.2
).
> a[300]
[[1]]
[1] "A" "f" "2" "0" "." "2"
Should read as:
> a[300]
[[1]]
[1] "A" "f" "20" "." "2"
So the steps I need help solving are:
line
section of the string when over 9.day.df
dataframe in the four required columnsUpvotes: 7
Views: 338
Reputation: 269526
Read the data:
Lines <- "vial response explanatory
Xm1.1 0 4
Xm2.1 0 4
Xm3.1 0 4
Xm4.1 0 4
"
day.df <- read.table(text = Lines, header = TRUE, as.is = TRUE)
1) then process it using strapplyc
. (we used as.is=TRUE
so that day.df$vial
is character but if its a factor
in your data frame then replace day.df$Vial
with as.character(day.df$vial)
. ) This approach does the parsing in just one short line of code:
library(gsubfn)
s <- strapplyc(day.df$vial, "(.)(.)(\\d+)[.](.)", simplify = rbind)
# we can now cbind it to the original data frame
colnames(s) <- c("treatment", "gender", "line", "block")
cbind(day.df, s)
which gives:
vial response explanatory treatment gender line block
1 Xm1.1 0 4 X m 1 1
2 Xm2.1 0 4 X m 2 1
3 Xm3.1 0 4 X m 3 1
4 Xm4.1 0 4 X m 4 1
2) Here is a different approach. This does not use any packages and is relatively simple (no regular expressions at all) and only involves one R statement including the cbind'ing:
transform(day.df,
treatment = substring(vial, 1, 1), # 1st char
gender = substring(vial, 2, 2), # 2nd char
line = substring(vial, 3, nchar(vial)-2), # 3rd through 2 prior to last char
block = substring(vial, nchar(vial))) # last char
The result is as before.
UPDATE: Added second approach.
UPDATE: Some simplifications.
Upvotes: 4
Reputation: 121568
using gsub
and strsplit
like this :
v <- c('Xm1.1','Xf3.2')
h <- gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])','\\1|\\2|\\3|\\4',v)
do.call(rbind,strsplit(h,'[|]'))
[,1] [,2] [,3] [,4]
[1,] "X" "m" "1" "1"
[2,] "X" "f" "3" "2"
the result it is a data.frame, you can cbind
it to your original data.frame.
EDIT @GriffinEvo Applied & tested code:
a = gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])',
'\\1|\\2|\\3|\\4',day.df$vial)
do.call(rbind, strsplit(a,'[|]') )
day.df = cbind(day.df,do.call(rbind,strsplit(a,'[|]')))
colnames(day.df)[4:7] = c ("treatment" , "gender" , "line" , "block")
Upvotes: 8
Reputation: 4180
An alternative way that does not require the use of regular expressions is to use substr()
in combination with the fact that the last part of your code is a numeric value.
Let's say your data is this:
d1 <- read.table(header=TRUE,text="
vial response explanatory
Xm1.1 0 4
Xm2.1 0 4
Xm3.2 0 4
Xm44.1 0 4")
Then the result can be achieved by:
d1$line <- as.integer(substr(x=d1$vial,3,6))
d1$block <- (as.numeric(substr(x=d1$vial,3,6)) %% 1)*10
d1$treatment <- substr(x=d1$vial,1,1)
d1$gender <- substr(x=d1$vial,2,2)
The numeric part begins always after exactly two symbols, regardless of the number of digits. We extract that part, and take digits before the decimal in the first line, and digits after the decimal in the second line. Extracting treatment and gender is straightforward.
Upvotes: 1