Meh
Meh

Reputation: 607

Beautiful soup, html table parsing

I am currently having a bit of an issue trying to parse a table into an array.

I have a simple table (HERE) which I need to parse with BS4 and put the cell contents into an array. What makes things difficult here is the fact that the cells don't contain text, but rather have images which have these titles: "Confirm" or "Site" - this is just user right's stuff. [I am skipping row one which contains the checkboxes, those i can extract without problems]

If you look at the fiddle above, all I need to do is to parse it in such a way that the resulting array becomes:

Array1[0] = User1,Confirm,Confirm,Site,Confirm
Array1[1] = User2,Confirm,Confirm,Confirm,Confirm
Array1[2] = User3,Confirm,Confirm,Confirm,Confirm
Array1[3] = User4,Confirm,Site,Site,Confirm

Which I can then do as I please with. Another complication is that sometimes the number of rows will vary so the script should be able to adapt to this and recursively create the array from the table.

At the moment StackOverflow is my only hope.. I have spent the last 10 hours doing this myself with little to no success and frankly I have lost hope. Closest I got to getting something out was extractin the enclosed tags, but could not parse further for some weird reason, perhaps it's bs4's nesting limitation? Could anyone have a look, please, and see if they can find a way of doing this? Or at least explain how to get there?

var explanations: rightml - the soup for the table.

allusers = []
rows = rightml.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        if (td.find(title="Group")) or (td.find(title="User")):
            text = ''.join(td.text.strip())
            allusers.append(text)
print allusers

gifrights = []

rows7 = rightml.findAll('td')
#print rows7
for tr7 in rows:
    cols7 = tr7.findAll('img')
    for td7 in cols7:
        if (td7.find(title="Confirm")) or (td7.find(title="Site")):
            text = ''.join(td7.text.strip())
            text2 = text.split(' ')
            print text2
            gifrights.append(text2)

I could be WAY off with this code.. but I gave it the ol' college try.

Upvotes: 4

Views: 4085

Answers (2)

Nathan Villaescusa
Nathan Villaescusa

Reputation: 17659

Would something like this work:

rows = soup.find('tbody').findAll('tr')

for row in rows:
    cells = row.findAll('td')

    output = []

    for i, cell in enumerate(cells):
        if i == 0:
            output.append(cell.text.strip())
        elif cell.find('img'):
            output.append(cell.find('img')['title'])
        elif cell.find('input'):
            output.append(cell.find('input')['value'])
    print output

This outputs the following:

[u'Logged-in users', u'True', u'True', u'True', u'True']
[u'User 1', u'Confirm', u'Confirm', u'Site', u'Confirm']
[u'User 2', u'Confirm', u'Confirm', u'Confirm', u'Confirm']
[u'User 3', u'Confirm', u'Confirm', u'Confirm', u'Confirm']
[u'User 4', u'Confirm', u'Site', u'Site', u'Confirm']

Upvotes: 6

kreativitea
kreativitea

Reputation: 1791

I think it's faster to use list comprehension over the rows as such.

rows = soup.find('tbody').findAll('tr')

for i in rows[1:]: # the first row is thrown out
    [j['title'] for j in i.findAll('img')]

Which gives you

['User', 'Confirm', 'Confirm', 'Site', 'Confirm']
['User', 'Confirm', 'Confirm', 'Confirm', 'Confirm']
['User', 'Confirm', 'Confirm', 'Confirm', 'Confirm']
['User', 'Confirm', 'Site', 'Site', 'Confirm']

You can cut out even more steps using nested list comprehension:

# superpythonic
[[j['title'] for j in i.findAll('img')] for i in rows[1:]]

# all together now, but not so pythonic
[[j['title'] for j in i.findAll('img')] for i in soup.find('tbody').findAll('tr')[1:]]

You don't really need a User#, since the user# is the index number + 1.

[[j['title'] for j in i.findAll('img') if j['title'] != 'User'] for i in rows[1:]]

But, if you -must- have one...

for i in xrange(len(users)):
    users[i].append("User " + str(i+1))

But, if you were to insist on doing this, I would use a namedtuple as a data structure instead of a list. namedtuple

from collections import namedtuple
# make these actual non-obfuscated names, not column numbers
User = namedtuple('User', ('num col_1 col_2 col_3 col_4') 

And then, once you have an instance of namedtuple for, say, User 1 as user, you can...

>>> user.num
... 1
>>> user.col_1
... 'Confirm'
>>> user.col_2
... 'Confirm'
>>> user.col_3
... 'Site'
>>> user.col_4
... 'Confirm'

Upvotes: 4

Related Questions