Reputation: 7113
say I have a molecular formula like "C5Cl2NO2S" for which I'd like to calculate in R the molecular mass. I though the easiest would be to use a regular expression, to analyze and split the formula into it's elemental components and hand those over to a separate function which performs the calculation. However, I'm facing the problem that, when I hand over the backreferences of my RegEx, these are not evaluated but handed over as "\\1", "\\2".
This is my attempt:
masses <- list(
C = 12,
H = 1.01,
Cl = 34.97,
N = 14.00,
O = 15.99,
P = 30.97,
S = 31.97
)
elementMass <- function( element, count ) {
if( count == "" ) {
count <- "1"
}
return( as.character( masses[[ element ]] * as.numeric( count ) ) )
}
sumFormula2Mass <- function( x ){
y <- 0.0
for( e in x ) {
if( e != "" ) {
y <- y + as.numeric( sub( "^(C|H|Cl|N|O|P|S)([0-9]*)$", elementMass("\\1", "\\2"), e ) )
}
}
return( y )
}
sub(
"^(C[0-9]*)?(H[0-9]*)?(Cl[0-9]*)?(N[0-9]*)?(O[0-9]*)?(P[0-9]*)?(S[0-9]*)?$",
sumFormula2Mass( c("\\1", "\\2", "\\3", "\\4", "\\5", "\\6", "\\7") ),
"C5Cl2NO2S"
)
Any ideas how to improve this? Many thanks
Upvotes: 3
Views: 648
Reputation: 14093
have a look at
RSiteSearch ("molecular weight")
I guess the 2nd or 3rd hit is what you are looking for (the first is for proteins).
(sorry, didn't see your comment on the other answer - however, I leave this in case someone is really looking for molecular weight calculation).
Upvotes: 0
Reputation: 269694
Below we assumed the form of formula in the question, i.e. a string of components each of which is an upper case letter followed optionally by lower case letters followed optionally by digits. We use gsubfn
in the gsubfn package. It is like gsub
except the replacement string can be various other objects. Here its a proto object. A proto object is an environment and here its used for containing a property, sum
, and two methods, pre
and fun
. At the beginning pre is automatically run and has the effect of initializing sum
. Then each time the regular expression matches, the proto object and the two referenced strings are passed to fun
and fun
is run to process them. At the end p$sum
contains the result. The variable masses
is defined in the question.
library(gsubfn)
p <- proto(pre = function(this) this$sum <- 0,
fun = function(this, name, count) {
count <- as.numeric(count)
if (is.na(count)) count <- 1
this$sum <- this$sum + masses[[name]] * count
""
})
gsubfn("([[:upper:]][[:lower:]]*)(\\d*)", p, "C5Cl2NO2S")
p$sum # 207.89
Upvotes: 5
Reputation: 60746
Believe it or not, this appears to not be all that uncommon:
http://www.sitepoint.com/forums/php-34/chemical-formula-regular-expressions-317012.html
I found the above by Googling molecular formula regex
Upvotes: 0
Reputation: 7023
I don't think that backreferences in sub()
work like this. You seem to be treating them as though they are returned values, when they are inputs.
Here is a different solution. It takes a very different approach, namely splitting the string up into individual parts and then referring to these. However, it has some limitations. Firstly, it assumes cannot handle chemical formulae with parentheses. Secondly, it assumes that atoms are written sensibly (i.e. chlorine is written as Cl - upper case C and lower case l). There are probably lots of other limitations, but this should give you an idea about how this solution might look.
sumFormula2Mass2 <- function(x,masses){
summedMasses <- NULL
for(e in x){
## split up the string
split.e <- unlist(strsplit(e,''))
## join letters from individual elements (since subequent letters should be lower case)
ilower <- grep('[a-z]',split.e)
if(length(ilower) > 0){
for(i in 1:length(ilower)){
j <- ilower[i]
split.e <- c(if(j > 2) split.e[1:(j-2)],
paste(split.e[(j-1):j],collapse=''),
if(j < length(split.e)) split.e[(j+1):length(split.e)])
ilower <- ilower - 1
}
}
## join numbers together (in case there are more than 10 atoms)
inum <- grep('[0-9]',split.e)
if(length(inum) > 1){
for(i in 1:(length(inum)-1)){
if(inum[i + 1] == inum[i] + 1){
j <- inum[i]
split.e <- c(split.e[1:(j-1)],
paste(split.e[j:(j+1)],collapse=''),
if(j+2 <= length(split.e)) split.e[(j+2):length(split.e)])
inum <- inum - 1
}
}
}
## add up the mass
sumMass = 0
for(i in 1:length(split.e)){
if(length(grep('[1-9]',split.e[i])) > 0){
next
} else if(split.e[i] %in% names(masses)){
nMolecules <- 1
if(i != length(split.e) && length(grep('[1-9]',split.e[i+1])) > 0)
nMolecules <- as.numeric(split.e[i+1])
sumMass <- sumMass + nMolecules * masses[[split.e[i]]]
} else {
warning(sprintf('Could not match element %s',split.e[i]))
next
}
}
summedMasses <- c(summedMasses,sumMass)
}
return(summedMasses)
}
Here are some results for your compound plus some made-up compounds (I am not a chemist):
> sumFormula2Mass2(c("C5Cl2NO2S","C5Cl2NO2S4","C5Cl10NO2S4"),masses)
[1] 207.89 303.80 583.56
Upvotes: 3