Reputation: 666
I'm trying to do the following thing: given a single-column pandas.Dataframe
(of chemical formulas) like
formula
0 Hg0.7Cd0.3Te
1 CuBr
2 Lu
...
I would like to return a pandas.Series
like
0 [(Hg, 0.7), (Cd, 0.3), (Te,1)]
1 [(Cu, 1), (Br, 1)]
2 [(Lu, 1), (P, 1)]
...
So this is the desired output.
I've already tried something with a regex expression:
counts = pd.Series(formulae.values.flatten()).str.findall(r"([a-z]+)([0-9]+)", re.I)
but unfortunately my output is the following:
0 [(Hg, 0), (Cd, 0)]
1 []
2 []
3 [(Cu, 3), (SbSe, 4)]
so it's not recognizing in some cases different elements in the chemical formula.
Upvotes: 0
Views: 287
Reputation: 627126
You can use
import pandas as pd
df = pd.DataFrame({'formula':['Hg0.7Cd0.3Te', 'CuBr', 'Lu']})
df['counts'] = df['formula'].str.findall(r'([A-Z][a-z]*)(\d+(?:\.\d+)?)?')
df['counts'] = df['counts'].apply(lambda x: [(a,b) if b else (a,1) for a,b in x])
Output:
>>> df['counts']
0 [(Hg, 0.7), (Cd, 0.3), (Te, 1)]
1 [(Cu, 1), (Br, 1)]
2 [(Lu, 1)]
Details:
([A-Z][a-z]*)
- Group 1: an uppercase letter followed with zero or more lowercase letters(\d+(?:\.\d+)?)?
- an optional group 2: one or more diits followed with an optional occurrence of a dot and one or more digits.The df['counts'].apply(lambda x: [(a,b) if b else (a,1) for a,b in x])
adds 1
as each tuple second item where it is empty.
Upvotes: 1
Reputation: 26676
Would use multiple replace to introduce separators, split using introduced separators, explode and then filter. Code below
repl2 = lambda g: f'{str(g.group(1)) }<'
repl3 = lambda g: f'{str(g.group(1)) }>'
df1 = (df1.assign(formula1=df1['formula'].str.replace('((?<=[A-Z])\w)', repl3, regex=True)#Introduce separator where alpha numeric follows a cap letter
.str.replace('(\d(?=[A-Z]))', repl2, regex=True))#Introduce separator where digits is followed by cap letter
.replace(regex={r'\>(?=0)': ',', '\>': ',1 '})#Replace the < and > introduced separators
)
df1=df1.assign(formula1=df1['formula1'].str.split('\<|\s')).explode('formula1')#Explode dataframe
new=df1[df1['formula1'].str.contains('\w')]#filter those rows that have details
formula formula1
0 Hg0.7Cd0.3Te Hg,0.7
0 Hg0.7Cd0.3Te Cd,0.3
0 Hg0.7Cd0.3Te Te,1
1 CuBr Cu,1
1 CuBr Br,1
2 Lu Lu,1
Upvotes: 0
Reputation: 36299
There are a few things to be improved:
([0-9]+(?:[.][0-9]+)?)
instead.?
.[A-Z][a-z]*
. That's important to distinguish different elements with no number in between, e.g. 'CuBr'
(so ignore-case wouldn't work here).Putting it all together:
from pprint import pprint
import re
formulae = ['Hg0.7Cd0.3Te', 'CuBr', 'Lu']
pattern = re.compile('([A-Z][a-z]*)([0-9]+(?:[.][0-9]+)?)?')
pprint([pattern.findall(f) for f in formulae])
The prints the following:
[[('Hg', '0.7'), ('Cd', '0.3'), ('Te', '')],
[('Cu', ''), ('Br', '')],
[('Lu', '')]]
As you can see, missing numbers are denoted by empty strings which you need to postprocess manually. For example:
result = [pattern.findall(f) for f in formulae]
result = [[(e, float(n or 1)) for e, n in f] for f in result]
Upvotes: 1