Reputation: 195
I'm trying to extract tags from an XML file using RE in Python. I need to extract nodes that start with tag "< PE" and their corresponding Unit IDs which are nodes above each tag "<PE". The file can be seen here
When I use the below code, I don't get the correct tags "<unit IDs", that is, the ones that correspond to each tag "<PE". For example, in my output, the content extracted from tag "<PE" with "<Unit ID=250" is actually "<Unit ID=149" in the original file. Besides, the code skips some tags "<Unit ID". Does anyone see in my code where's the error?
import re
t=open('ALICE.per1_replaced.txt','r')
t=t.read()
unitid=re.findall('<unit.*?"pe">', t, re.DOTALL)
PE=re.findall("<PE.*?</PE>", t, re.DOTALL)
a=zip(unitid,PE)
tp=tuple(a)
w=open('Tags.txt','w')
for x, j in tp:
a=x + '\n'+j + '\n'
w.write(a)
w.close()
I've tried this version as well but I had the same problems:
with open('ALICE.per1_replaced.txt','r') as t:
contents = t.read()
unitid=re.findall('<unit.*?"pe">', contents, re.DOTALL)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtagsper1.txt','w') as fi:
for i, p in zip(unitid, PE):
fi.write( "{}\n{}\n".format(i, p))
my desired output is a file with tags "<Unit ID=" followed by the content within the tag that starts with "<PE" and ends with "" as below:
<unit id="16" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
<head>
</head>
<body>
Eu vou me atrasar!' (quando ela voltou a pensar sobre isso mais trade,
ocorreu-lhe que deveria ter achado isso curioso, mas na hora tudo pareceu
bastante natural); mas quando o Coelho de fato tirou um relógio do bolso
do colete e olhou-o, e então se apressou, Alice pôs-se de pé, pois lhe
ocorreu que nunca antes vira um coelho com um colete, ou com um relógio de
bolso pra tirar, e queimando de curiosidade, ela atravessou o campo atrás
dele correndo e, felizmente, chegou justo a tempo de vê-lo entrar dentro
de uma grande toca de coelho sob a cerca.
</body>
</html></PE>
Upvotes: 0
Views: 169
Reputation: 124
You seem to have multiple tags under each tag (eg, for unit 3), thus the zip doesn't work correctly. As @Error_2646 noted in comments, some XML or beautiful soup package would work better for this job.
But if for whatever reason you want to stick to regex, you can fix this by running a regex on the list of strings returned by the first regex. Example code that worked on the small part of the input I downloaded:
units=re.findall('<unit.*?</unit>', t, re.DOTALL)
unitList = []
for unit in units:
#first get your unit regex
unitid=re.findall('<unit.*?"pe">', unit, re.DOTALL) # same as the one you use
#there should only be one within each
assert (len(unitid) == 1)
#now find all pes for this unit
PE=re.findall("<PE.*?</PE>", unit, re.DOTALL) # same as the one you use
# combine results
output = unitid[0] + "\n"
for pe in PE:
output += pe + "\n"
unitList.append(output)
for x in unitList:
print(x)
Upvotes: 1