Reputation: 57
I have a list of thousands of chemical formulas that could include symbols for any element. I would like to determine the total number of atoms of any element in each formula. Examples include:
I want the total number of atoms in a single formula, so for the first example (CH3NO3), the answer would be 8 (1 carbon + 3 hydrogens + 1 nitrogen + 3 oxygens).
I found code by PEH (Extract numbers from chemical formula) that uses regular expression to extract the number of instances of a specific element in a chemical formula.
Could this be adapted to give the total atoms?
Public Function ChemRegex(ChemFormula As String, Element As String) As Long
Dim regEx As New RegExp
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
End With
'first pattern matches every element once
regEx.Pattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
Dim Matches As MatchCollection
Set Matches = regEx.Execute(ChemFormula)
Dim m As Match
For Each m In Matches
If m.SubMatches(0) = Element Then
ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
End If
Next m
'second patternd finds parenthesis and multiplies elements within
regEx.Pattern = "(\((.+?)\)([0-9])+)+?"
Set Matches = regEx.Execute(ChemFormula)
For Each m In Matches
ChemRegex = ChemRegex + ChemRegex(m.SubMatches(1), Element) * (m.SubMatches(2) - 1) '-1 because all elements were already counted once in the first pattern
Next m
End Function
Upvotes: 2
Views: 526
Reputation: 75930
Here's my two cent's
Formula in C1
:
=ChemRegex(A1)
Where ChemRegex()
calls:
Public Function ChemRegex(ChemFormula As String) As Long
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "[A-Z][a-z]*(\d*)"
If .Test(ChemFormula) Then
Set matches = .Execute(ChemFormula)
For Each Match In matches
ChemRegex = ChemRegex + IIf(Match.Submatches(0) = "", 1, Match.Submatches(0))
Next
Else
ChemRegex = 0
End If
End With
End Function
Or in a (shorter) 2-step regex-solution:
Public Function ChemRegex(ChemFormula As String) As Long
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "([A-Za-z])(?=[A-Z]|$)"
ChemFormula = .Replace(ChemFormula, "$1-1")
.Pattern = "\D+"
ChemFormula = .Replace(ChemFormula, "+")
ChemRegex = Evaluate(ChemFormula)
End With
End Function
Upvotes: 1
Reputation: 57743
You could do that by looping through all characters. Count all capital characters and add all numbers subtracted by 1. That is the total count of elements.
Option Explicit
Public Function ChemCountTotalElements(ByVal ChemFormula As String) As Long
Dim RetVal As Long
Dim c As Long
For c = 1 To Len(ChemFormula)
Dim Char As String
Char = Mid$(ChemFormula, c, 1)
If IsNumeric(Char) Then
RetVal = RetVal + CLng(Char) - 1
ElseIf Char = UCase(Char) Then
RetVal = RetVal + 1
End If
Next c
ChemCountTotalElements = RetVal
End Function
Note that this does not handle parenthesis! And it does not check if the element actually exists. So XYZ2
will be counted as 4
.
Also this only can handle numbers below 10
. In case you have numbers with 10
and above use the RegEx solution below (which can handle that).
If you need a more precise way (checking the existance of the Elements) and recognizing parenthesis you need to do it with RegEx again.
Because VBA doesn't support regular expressions out of the box we need to reference a Windows library first.
Add this function to a module
Public Function ChemRegexCountTotalElements(ByVal ChemFormula As String) As Long
Dim RetVal As Long
Dim regEx As New RegExp
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
End With
'first pattern matches every element once
regEx.Pattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
Dim Matches As MatchCollection
Set Matches = regEx.Execute(ChemFormula)
Dim m As Match
For Each m In Matches
RetVal = RetVal + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
Next m
'second patternd finds parenthesis and multiplies elements within
regEx.Pattern = "(\((.+?)\)([0-9]+)+)+?"
Set Matches = regEx.Execute(ChemFormula)
For Each m In Matches
RetVal = RetVal + ChemRegexCountTotalElements(m.SubMatches(1)) * (m.SubMatches(2) - 1) '-1 because all elements were already counted once in the first pattern
Next m
ChemRegexCountTotalElements = RetVal
End Function
While this code will also recognize parenthesis, note that it does not recognize nested parenthesis.
Upvotes: 6