python
python

Reputation: 308

Parsing html table with BeautifulSoup to python dictionary

This is an html code than I'm trying to parse with BeautifulSoup:

<table>
          <tr>
            <th width="100">menu1</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data1</li>
                    <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                    ... (amount of this tags isn't fixed)
              </ul>
            </td>
          </tr>
          <tr>
            <th width="100">menu2</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data2</li>
                    <li>Foo2<a href="/link/to/bar2">Bar2</a></li>
                    <li>Foo3<a href="/link/to/bar3">Bar3</a></li>
                    <li>Some data3</li>
                    ... (amount of this tags isn't fixed too)
              </ul>
            </td>
          </tr>
</table>

The output I would like to get is a dictionary like this:

DICT = {
    'menu1': ['Some data1','Foo1 Bar1'],
    'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
}

As I already mentioned in the code, amount of <li> tags is not fixed. Additionally, there could be:

  • menu1 and menu2
  • just menu1
  • just menu2
  • no menu1 and menu2 (just <table></table>)

    so e.g. it could looks just like this:

    <table>
              <tr>
                <th width="100">menu1</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data1</li>
                        <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                        ... (amount of this tags isn't fixed)
                  </ul>
                </td>
              </tr>
    </table>
    

    I was trying to use this example but with no success. I think it's because of that <ul> tags, I can't read proper data from table. Problem for me is also variable amount of menus and <li> tags. So my question is how to parse this particular table to python dictionary? I should mention that I already parsed some simple data with .text attribute of BeautifulSoup handler so it would be nice if I could just keep it as is.

    request = c.get('http://example.com/somepage.html)
    soup = bs(request.text)
    

    and this is always the first table of the page, so I can get it with:

    table = soup.find_all('table')[0]
    

    Thank you in advance for any help.

    Upvotes: 1

    Views: 3538

  • Answers (1)

    furas
    furas

    Reputation: 142651

    html = """<table>
              <tr>
                <th width="100">menu1</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data1</li>
                        <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                  </ul>
                </td>
              </tr>
              <tr>
                <th width="100">menu2</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data2</li>
                        <li>Foo2<a href="/link/to/bar2">Bar2</a></li>
                        <li>Foo3<a href="/link/to/bar3">Bar3</a></li>
                        <li>Some data3</li>
                  </ul>
                </td>
              </tr>
    </table>"""
    
    import BeautifulSoup as bs
    
    soup = bs.BeautifulSoup(html)
    
    table = soup.findAll('table')[0]
    
    results = {}
    
    th = table.findChildren('th')#,text=['menu1','menu2'])
    
    for x in th:
        #print x
        results_li = []
        li = x.nextSibling.nextSibling.findChildren('li')
        for y in li:
            #print y.next
            results_li.append(y.next)
        results[x.next] = results_li
    
    print results
    

    .

    {
        u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'], 
        u'menu1': [u'Some data1', u'Foo1']
    }
    

    Upvotes: 1

    Related Questions