Reputation: 175
I have the following string which I am parsing from another file : "CHEM1(5GL) CH3M2(55LB) CHEM3954114(50KG)" What I want to do is split them up into individual values, which I achieve using the .split() function. So I get them as an array:
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
Now I want to further split them into 3 segments, and store them in 3 other variables so I can write them to excel as such :
a = CHEM1
b = 5
c = GL
for the first array, then I will loop back for the second array:
a = CH3M2
b = 55
c = LB
and finally :
a = CHEM3954114
b = 50
c = KG
I am unsure how to go about that as I am still new in python. To the best of my acknowledge I iterate multiple times with the split function, but I believe there has to be a better way to do it than that.
Thank you.
Upvotes: 2
Views: 687
Reputation: 3379
You should use the re
package:
import re
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
pattern = re.compile("([^\(]+)\((\d+)(.+)\)")
for x1 in x:
m = pattern.search(x1)
if m:
a, b, c = m.group(1), int(m.group(2)), m.group(3)
FOLLOW UP:
The regex topic is enormous and extremely well covered on this site - as Tim has highlighted above. I can share my thinking for this specific case. Essentially, there are 3 groups of characters you want to extract:
(
- not included(
)
- not included.A group is anything included between brackets ()
: in this specific case, it may become confusing because, as stressed above, you have brackets as part of sentence - which will need to be escaped with a \
to be distinguished from the ones used in the regular expression.
([^\(]+)
, which essentially means: match one or more characters which are not (
(the ^
is the negation, and the bracket (
needs to be escaped here, for the reasons described above). Note that characters may include not only letters and numbers but also special characters like $, £, - and so forth. I wanted to keep my options open here, but you can be more laser guided if you need (including, for example, only numbers and letters using [\w]+
)(\d+)
, which is essentially matching 1 or more (expressed with +
) digits (expressed with \d
).(.+)
- match any remaining characters, with the final \)
making sure that you match any remaining characters up to the closing bracket.Upvotes: 4
Reputation: 68004
Use re and create a list of dictionaries
import re
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
keys =['a', 'b', 'c']
y = []
for s in x:
vals = re.sub(r'(.*?)\((\d*)(.*?)\)', r'\1 \2 \3', s).split()
y.append(dict(zip(keys, vals)))
[print("a: %s\nb: %s\nc: %s\n" % (i['a'], i['b'], i['c'])) for i in y]
gives
a: CHEM1
b: 5
c: GL
a: CH3M2
b: 55
c: LB
a: CHEM3954114
b: 50
c: KG
Upvotes: 0
Reputation: 181
Considering the elements you provided in your question, I assume that there can not be '(' more than once in an element.
Here is the function I wrote.
def decontruct(chem):
name = chem[:chem.index('(')]
qty = chem[chem.index('(') + 1:-1]
mag, unit = "", ""
for char in qty:
if char.isalpha():
unit += char
else:
mag += char
return {"name": name, "mag": float(mag), "unit": unit} # If you don't want to convert mag into float then just use int(mag) instead of float(mag).
Usage:
x = ['CHEM1(5.4GL)', 'CH3M2(55LB)', 'CHEM3954114(50KG)']
for chem in x:
d = decontruct(chem)
print(d["name"], d["mag"], d["unit"])
Upvotes: 1
Reputation: 521053
Using re.findall
we can try:
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
for inp in x:
matches = re.findall(r'(\w+)\((\d+)(\w+)\)', inp)
print(matches)
# [('CHEM1', '5', 'GL')]
# [('CH3M2', '55', 'LB')]
# [('CHEM3954114', '50', 'KG')]
Upvotes: 4