Extract tags from xml using python

Question

I'm trying to extract tags from an XML file using RE in Python. I need to extract nodes that start with tag "< PE" and their corresponding Unit IDs which are nodes above each tag "here

When I use the below code, I don't get the correct tags "

import re

t=open('ALICE.per1_replaced.txt','r')

t=t.read()




unitid=re.findall('', t, re.DOTALL)
PE=re.findall("", t, re.DOTALL)


a=zip(unitid,PE)

tp=tuple(a)


w=open('Tags.txt','w')

for x, j in tp:
    a=x + '
'+j + '
'

    w.write(a)



w.close()

I've tried this version as well but I had the same problems:

with open('ALICE.per1_replaced.txt','r') as t:
  contents = t.read()

unitid=re.findall('', contents,  re.DOTALL)
PE=re.findall('', contents, re.DOTALL)
with open('PEtagsper1.txt','w') as fi:
    for i, p in zip(unitid, PE):
        fi.write( "{}
{}
".format(i, p))

my desired output is a file with tags "



  

  
  
    Eu vou me atrasar!' (quando ela voltou a pensar sobre isso mais trade, 
    ocorreu-lhe que deveria ter achado isso curioso, mas na hora tudo pareceu 
    bastante natural); mas quando o Coelho de fato tirou um relógio do bolso 
    do colete e olhou-o, e então se apressou, Alice pôs-se de pé, pois lhe 
    ocorreu que nunca antes vira um coelho com um colete, ou com um relógio de 
    bolso pra tirar, e queimando de curiosidade, ela atravessou o campo atrás 
    dele correndo e, felizmente, chegou justo a tempo de vê-lo entrar dentro 
    de uma grande toca de coelho sob a cerca.

Karthik Sriram · Accepted Answer

You seem to have multiple tags under each tag (eg, for unit 3), thus the zip doesn't work correctly. As @Error_2646 noted in comments, some XML or beautiful soup package would work better for this job.

But if for whatever reason you want to stick to regex, you can fix this by running a regex on the list of strings returned by the first regex. Example code that worked on the small part of the input I downloaded:

units=re.findall('', t, re.DOTALL)
unitList = []
for unit in units:
    #first get your unit regex
    unitid=re.findall('', unit, re.DOTALL) # same as the one you use
    #there should only be one within each
    assert (len(unitid) == 1)
    #now find all pes for this unit
    PE=re.findall("", unit, re.DOTALL) # same as the one you use
    # combine results
    output = unitid[0] + "
"
    for pe in PE:
        output += pe + "
"
    unitList.append(output)

for x in unitList:
    print(x)

Extract tags from xml using python

Answers (1)

Related Questions