Natalia Resende
Natalia Resende

Reputation: 195

Extract tags from xml using python

I'm trying to extract tags from an XML file using RE in Python. I need to extract nodes that start with tag "< PE" and their corresponding Unit IDs which are nodes above each tag "<PE". The file can be seen here

When I use the below code, I don't get the correct tags "<unit IDs", that is, the ones that correspond to each tag "<PE". For example, in my output, the content extracted from tag "<PE" with "<Unit ID=250" is actually "<Unit ID=149" in the original file. Besides, the code skips some tags "<Unit ID". Does anyone see in my code where's the error?

import re

t=open('ALICE.per1_replaced.txt','r')

t=t.read()




unitid=re.findall('<unit.*?"pe">', t, re.DOTALL)
PE=re.findall("<PE.*?</PE>", t, re.DOTALL)


a=zip(unitid,PE)

tp=tuple(a)


w=open('Tags.txt','w')

for x, j in tp:
    a=x + '\n'+j + '\n'

    w.write(a)



w.close()

I've tried this version as well but I had the same problems:

with open('ALICE.per1_replaced.txt','r') as t:
  contents = t.read()

unitid=re.findall('<unit.*?"pe">', contents,  re.DOTALL)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtagsper1.txt','w') as fi:
    for i, p in zip(unitid, PE):
        fi.write( "{}\n{}\n".format(i, p))

my desired output is a file with tags "<Unit ID=" followed by the content within the tag that starts with "<PE" and ends with "" as below:

<unit id="16" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
  <head>

  </head>
  <body>
    Eu vou me atrasar!' (quando ela voltou a pensar sobre isso mais trade, 
    ocorreu-lhe que deveria ter achado isso curioso, mas na hora tudo pareceu 
    bastante natural); mas quando o Coelho de fato tirou um relógio do bolso 
    do colete e olhou-o, e então se apressou, Alice pôs-se de pé, pois lhe 
    ocorreu que nunca antes vira um coelho com um colete, ou com um relógio de 
    bolso pra tirar, e queimando de curiosidade, ela atravessou o campo atrás 
    dele correndo e, felizmente, chegou justo a tempo de vê-lo entrar dentro 
    de uma grande toca de coelho sob a cerca.
  </body>
</html></PE>

Upvotes: 0

Views: 169

Answers (1)

Karthik Sriram
Karthik Sriram

Reputation: 124

You seem to have multiple tags under each tag (eg, for unit 3), thus the zip doesn't work correctly. As @Error_2646 noted in comments, some XML or beautiful soup package would work better for this job.

But if for whatever reason you want to stick to regex, you can fix this by running a regex on the list of strings returned by the first regex. Example code that worked on the small part of the input I downloaded:

units=re.findall('<unit.*?</unit>', t, re.DOTALL)
unitList = []
for unit in units:
    #first get your unit regex
    unitid=re.findall('<unit.*?"pe">', unit, re.DOTALL) # same as the one you use
    #there should only be one within each
    assert (len(unitid) == 1)
    #now find all pes for this unit
    PE=re.findall("<PE.*?</PE>", unit, re.DOTALL) # same as the one you use
    # combine results
    output = unitid[0] + "\n"
    for pe in PE:
        output += pe + "\n"
    unitList.append(output)

for x in unitList:
    print(x)

Upvotes: 1

Related Questions