Sabby
Sabby

Reputation: 24

Python xml parsing with beautifulsoup

I have below xml file, I would like to extract all the href, I know how to do that, but I want to mark end of each main 'parent' tag with ----

I need an output like this:

xxxx yyyy ----- zzzz tttt ------ wwww qqqqq ssss uuuu oooo pppp ----- mmmm nnnnn ----

xml:

<root> <parent id1='1111'> <child herf='xxx'/> <child herf ='yyyy'/> </parent> <parent id1='22222'> <child herf='zzzz'/> <child herf ='tttt'/> </parent> <parent id1='33333'> <child herf='wwww'/> <child herf ='qqqqq'/> <parent id1='4444'> <child herf='ssss'/> <child herf ='uuuu'/> </parent> <parent id1='55555'> <child herf='oooo'/> <child herf ='pppp'/> </parent> <parent id1='6666'> <child herf='mmmm'/> <child herf ='nnnnn'/> </parent>

This is my code :

xml= soupTop.findChildren(recursive=False) for tag in xml: s =tag.findAll("child", {"href" : re.compile(r".*")}) print (s)

Upvotes: 0

Views: 220

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 149175

One problem is that your xml is not valid. <root> tag is never closed, nor is <child id1='33333'>. BS is good at accepting incorrect input, but processing it requires to be very cautious.

That means that I cannot imagine a way to obtain the output you are asking for in the question. What I can do is:

  1. assume that each new opening parent tag opens a new sequence of children. That means find all parent tags and in each one process only direct children

    for p in soupTop.findAll('parent'):
        for c in p.children:
            if c.name == 'child':
                print(c['herf'], end =' ')
        print('-----', end = ' ')
    

    output is:

    xxx yyyy ----- zzzz tttt ----- wwww qqqqq ----- ssss uuuu ----- oooo pppp ----- mmmm nnnnn ----- 
    
  2. process only highest level parent tags, and in each one recursively find all child tags

    p = soup.find('parent')
    while p is not None:
        for c in p.findAll('child'):
            print(c['herf'], end=' ')
        print('-----', end = ' ')
        p = p.findNextSibling('parent')
    

    output is:

    xxx yyyy ----- zzzz tttt ----- wwww qqqqq ssss uuuu oooo pppp mmmm nnnnn ----- 
    

Upvotes: 1

Related Questions