zero
zero

Reputation: 1213

Parsing HTML data in a table using lxml

I'm still a learner in coding and a friend of mine told me to use BeautifulSoup. After running into some problems - i think that i should use lxml instead of BeautifulSoup because it's even better. I'm hoping someone can give me a hint how to scrape the text I'm looking for. What I want is to find a table with the following rows and data in the field "General Information".

BTW; i also tried to get the table’s elements with pandas but - meanwhile Pandas is really great, it does not help in every case. i think that i have to scrape some table element-wise, i don’t want the entire table the structure of my HTML table:

<table border="" width="100%">
<tbody><tr valign="top"><td width="50%">
<h3 align="center">item 1</h3>
<ul>
<li><a href="/link.html">name <b>mike/b></a><b>
</b>
<hr width="50%">
</li><li><a href="/link.html">name <b>john</b></a>,
<a href="link.html">name</a>fred</li></ul>


</td><td>
<h3 align="center"> General Information </h3><p></p><ul>
<li>Type of company
</li><li>foundet <a href="/calendar/dayoffoundation.html">10 December</a> <a href="/foundet.html">1900</a>
</li><li> category 1
</li><li>  category 2 
</li><li>Country: <a href="/country/california.html">california</a></li><li>
</li><li> Town: <a href="/country/sggf.html">san francisco</a>
</li><li>Official Web Site: <a href="https://www.demo-company.net/">https://www.demo-company/</a>
</li><li>Mailing Address: 
</li><li>Telephone: 3453455
</li><li>Fax: 433532
</li></ul></td></tr></tbody></table>

Where td stands for “table data”, , which is where the data is stored as text.

How do I scrape the website with lxml and get the following results?

[['General Information' 'foundet', 'category1', 'category2', 'country', 'and so forth'] note: all the rest on the page is not necessary!

i normaly use the pattern which is pretty helpful, so all we have left to do is select the correct elements using BeautifulSoup. The first thing to do is to find the table.

i normaly use the find_all() method that returns a list of all elements that satisfied the requirements we pass to it. We then must select the table we need in that list:

table = soup.find_all('table')[xy - here a number]

Upvotes: 1

Views: 111

Answers (1)

QHarr
QHarr

Reputation: 84465

You can use lxml with bs4. Just add in nth-child/nth-of-type to target the right td, then reach down for the h3 and the li (there are other ways such as adjacent sibling combinator):

from bs4 import BeautifulSoup

html = '''
<table border="" width="100%">
<tbody><tr valign="top"><td width="50%">
<h3 align="center">item 1</h3>
<ul>
<li><a href="/link.html">name <b>mike/b></a><b>
</b>
<hr width="50%">
</li><li><a href="/link.html">name <b>john</b></a>,
<a href="link.html">name</a>fred</li></ul>
</td><td>
<h3 align="center"> General Information </h3><p></p><ul>
<li>Type of company
</li><li>foundet <a href="/calendar/dayoffoundation.html">10 December</a> <a href="/foundet.html">1900</a>
</li><li> category 1
</li><li>  category 2 
</li><li>Country: <a href="/country/california.html">california</a></li><li>
</li><li> Town: <a href="/country/sggf.html">san francisco</a>
</li><li>Official Web Site: <a href="https://www.demo-company.net/">https://www.demo-company/</a>
</li><li>Mailing Address: 
</li><li>Telephone: 3453455
</li><li>Fax: 433532
</li></ul></td></tr></tbody></table>
'''
soup = bs(html, 'lxml')

print([i.text.strip() for i in soup.select('td:nth-child(2) > h3, td:nth-child(2) > ul > li')])

If you know in advance the header (which it seems you do), you can use a more targeted approach with :contains (:-soup-contains in latest versions):

from bs4 import BeautifulSoup
import pandas as pd

html = '''
<table border="" width="100%">
<tbody><tr valign="top"><td width="50%">
<h3 align="center">item 1</h3>
<ul>
<li><a href="/link.html">name <b>mike/b></a><b>
</b>
<hr width="50%">
</li><li><a href="/link.html">name <b>john</b></a>,
<a href="link.html">name</a>fred</li></ul>
</td><td>
<h3 align="center"> General Information </h3><p></p><ul>
<li>Type of company
</li><li>foundet <a href="/calendar/dayoffoundation.html">10 December</a> <a href="/foundet.html">1900</a>
</li><li> category 1
</li><li>  category 2 
</li><li>Country: <a href="/country/california.html">california</a></li><li>
</li><li> Town: <a href="/country/sggf.html">san francisco</a>
</li><li>Official Web Site: <a href="https://www.demo-company.net/">https://www.demo-company/</a>
</li><li>Mailing Address: 
</li><li>Telephone: 3453455
</li><li>Fax: 433532
</li></ul></td></tr></tbody></table>
'''
soup = bs(html, 'lxml')

df = pd.DataFrame(
                  [i.text.strip() for i in soup.select('td:has(h3:contains("General Information")) > ul > li')]
                 ,columns = ['General Information']
                )
print(df)

Upvotes: 1

Related Questions