Reputation: 35
I have the following sample HTML table from a html page.
<table id="fullRecordTable" valign="bottom" cellpadding="3" cellspacing="0" class="yellow" width="100%" summary="Vollanzeige des Suchergebnises">
...
<tr>
<td width="25%" class='yellow'>
<strong>Sachbegriff</strong>
</td>
<td class='yellow'>
Messung
</td>
</tr>
<tr>
...
</tr>
<tr>
...
</tr>
<tr>
...
</tr>
<tr>
<td width="25%" class='yellow'>
<strong>DDC-Notation</strong>
</td>
<td class='yellow'>
530.8<br/>T1--0287<br/>542.3
</td>
</tr>
I am trying to print "DDC-Notation"
and than the next three values: "530.8"
, "T1--0287"
, "542.3"
My code is:
soup = BeautifulSoup(data, "html.parser")
talbes = soup.findAll('table', id='fullRecordTable').find_all('tr')
for table in talbes:
tds = table.find_all('strong')
print tds.text
But it's doesn't work for the first.
P.S. Sorry, this is my first post. If I couldn't explain my problem, I'll try one more time
Upvotes: 1
Views: 2691
Reputation: 21643
Life is much easier if you use an interactive environment to debug your code because you can poke around looking for what you need.
In this case, I knew that you wanted to find a certain string. I looked for that in a direct way.
Having found it, I sought its grandparent, the td
element and then the sibling of that td
, another td
.
I made a that into a variable called td
, just for convenience because I wasn't sure how I would dig out the pieces you want.
Eventually I found that the children
property contains a list that includes the items you need. It's merely a matter of stripping out HTML tags, and new-lines and blanks.
>>> import bs4
>>> HTML = open('temp.htm').read()
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> strong = soup.find_all(string='DDC-Notation')
>>> strong
['DDC-Notation']
>>> strong[0].findParent()
<strong>DDC-Notation</strong>
>>> strong[0].findParent().findParent()
<td class="yellow" width="25%">
<strong>DDC-Notation</strong>
</td>
>>> strong[0].findParent().findParent().findNextSibling()
<td class="yellow">
530.8<br/>T1--0287<br/>542.3
</td>
>>> td = strong[0].findParent().findParent().findNextSibling()
>>> td
<td class="yellow">
530.8<br/>T1--0287<br/>542.3
</td>
>>> td.children
<list_iterator object at 0x00000000035993C8>
>>> list(td.children)
['\n 530.8', <br/>, 'T1--0287', <br/>, '542.3\n ']
Edit: It occurred to me this morning that this answer might be more useful to you if I offered a consolidated script. In writing it I discovered (once again) that there's a little bit more to processing the items in a list like that than might appear to be the case.
When Python outputs most things it converts them to strings for us automatically. But, when you process the items in a list of HTML elements they will elements not strings and if you want to process them as strings then you must try to convert them first, hence the need for the line `item = str(item).strip()'. It converts elements to strings and discards whitespace.
import bs4
HTML = open('temp.htm').read()
soup = bs4.BeautifulSoup(HTML, 'lxml')
strong = soup.find_all(string='DDC-Notation')
td = strong[0].findParent().findParent().findNextSibling()
for item in list(td.children):
item = str(item).strip()
if item.startswith('<'):
continue
print (item)
Output:
530.8
T1--0287
542.3
Upvotes: 1