Parsing HTML data in a table using lxml

Question

I'm still a learner in coding and a friend of mine told me to use BeautifulSoup. After running into some problems - i think that i should use lxml instead of BeautifulSoup because it's even better. I'm hoping someone can give me a hint how to scrape the text I'm looking for. What I want is to find a table with the following rows and data in the field "General Information".

BTW; i also tried to get the table’s elements with pandas but - meanwhile Pandas is really great, it does not help in every case. i think that i have to scrape some table element-wise, i don’t want the entire table the structure of my HTML table:



item 1

name mike/b>


name john,
namefred




 General Information 

Type of company
foundet 10 December 1900
 category 1
  category 2 
Country: california

 Town: san francisco
Official Web Site: https://www.demo-company/
Mailing Address: 
Telephone: 3453455
Fax: 433532

Where td stands for “table data”, , which is where the data is stored as text.

How do I scrape the website with lxml and get the following results?

[['General Information' 'foundet', 'category1', 'category2', 'country', 'and so forth'] note: all the rest on the page is not necessary!

i normaly use the pattern which is pretty helpful, so all we have left to do is select the correct elements using BeautifulSoup. The first thing to do is to find the table.

i normaly use the find_all() method that returns a list of all elements that satisfied the requirements we pass to it. We then must select the table we need in that list:

table = soup.find_all('table')[xy - here a number]

QHarr · Accepted Answer

You can use lxml with bs4. Just add in nth-child/nth-of-type to target the right td, then reach down for the h3 and the li (there are other ways such as adjacent sibling combinator):

from bs4 import BeautifulSoup

html = '''


item 1

name mike/b>


name john,
namefred


 General Information 

Type of company
foundet 10 December 1900
 category 1
  category 2 
Country: california

 Town: san francisco
Official Web Site: https://www.demo-company/
Mailing Address: 
Telephone: 3453455
Fax: 433532

'''
soup = bs(html, 'lxml')

print([i.text.strip() for i in soup.select('td:nth-child(2) > h3, td:nth-child(2) > ul > li')])

If you know in advance the header (which it seems you do), you can use a more targeted approach with :contains (:-soup-contains in latest versions):

from bs4 import BeautifulSoup
import pandas as pd

html = '''


item 1

name mike/b>


name john,
namefred


 General Information 

Type of company
foundet 10 December 1900
 category 1
  category 2 
Country: california

 Town: san francisco
Official Web Site: https://www.demo-company/
Mailing Address: 
Telephone: 3453455
Fax: 433532

'''
soup = bs(html, 'lxml')

df = pd.DataFrame(
                  [i.text.strip() for i in soup.select('td:has(h3:contains("General Information")) > ul > li')]
                 ,columns = ['General Information']
                )
print(df)

Parsing HTML data in a table using lxml

Answers (1)

Related Questions