Python - BeautifulSoup find strong and the next value of td class

Question

I have the following sample HTML table from a html page.



...

  
    ...
  
    ...
  
    ...
  
I am trying to print "DDC-Notation" and than the next three values: "530.8", "T1--0287", "542.3"
My code is: 
soup = BeautifulSoup(data, "html.parser")

talbes = soup.findAll('table', id='fullRecordTable').find_all('tr')

for table in talbes:
    tds = table.find_all('strong')  
    print tds.text
But it's doesn't work for the first. 
P.S. Sorry, this is my first post. If I couldn't explain my problem, I'll try one more time

    
      Sachbegriff
    
    
      Messung
    
  
  
  
  
  
  
  
  
  
    
      DDC-Notation
    
    
      530.8
T1--0287
542.3

Bill Bell · Accepted Answer

Life is much easier if you use an interactive environment to debug your code because you can poke around looking for what you need.

In this case, I knew that you wanted to find a certain string. I looked for that in a direct way.

Having found it, I sought its grandparent, the td element and then the sibling of that td, another td.

I made a that into a variable called td, just for convenience because I wasn't sure how I would dig out the pieces you want.

Eventually I found that the children property contains a list that includes the items you need. It's merely a matter of stripping out HTML tags, and new-lines and blanks.

>>> import bs4
>>> HTML = open('temp.htm').read()
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> strong = soup.find_all(string='DDC-Notation')
>>> strong
['DDC-Notation']
>>> strong[0].findParent()
DDC-Notation
>>> strong[0].findParent().findParent()

DDC-Notation

>>> strong[0].findParent().findParent().findNextSibling()

      530.8
T1--0287
542.3
    
>>> td = strong[0].findParent().findParent().findNextSibling()
>>> td

      530.8
T1--0287
542.3
    
>>> td.children

>>> list(td.children)
['
      530.8', 
, 'T1--0287', 
, '542.3
    ']

Edit: It occurred to me this morning that this answer might be more useful to you if I offered a consolidated script. In writing it I discovered (once again) that there's a little bit more to processing the items in a list like that than might appear to be the case.

When Python outputs most things it converts them to strings for us automatically. But, when you process the items in a list of HTML elements they will elements not strings and if you want to process them as strings then you must try to convert them first, hence the need for the line `item = str(item).strip()'. It converts elements to strings and discards whitespace.

import bs4
HTML = open('temp.htm').read()
soup = bs4.BeautifulSoup(HTML, 'lxml')
strong = soup.find_all(string='DDC-Notation')
td = strong[0].findParent().findParent().findNextSibling()
for item in list(td.children):
    item = str(item).strip()
    if item.startswith('<'):
        continue
    print (item)

Output:

530.8
T1--0287
542.3

Python - BeautifulSoup find strong and the next value of td class

Answers (1)

Related Questions