Reputation: 1

How to extract Table contents from an HTML page using BeautifulSoup in Python?

I am trying to scrape the following URL and so far have been able to use the following code to extract out the ul elements.

from bs4 import BeautifulSoup
import urllib
import csv
import requests
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content.prettify())
page_content.ul

However, my goal is to extract the information contained within the table into a csv file. How can I go about doing this judging from my current code?

Upvotes: 0

Answers (3)

trotta

Reputation: 1226

Although I think that KunduKs answer provides an elegant solution using pandas, I would like to give you another approach, since you explicitly asked how to go on from your current code (which is using the csv module and BeautifulSoup).

from bs4 import BeautifulSoup
import csv
import requests

new_file = '/path/to/new/file.csv'
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
table = page_content.find('table')

for i,tr in enumerate(table.findAll('tr')):
    row = []
    for td in tr.findAll('td'):
        row.append(td.text)
    if i == 0: # write header
        with open(new_file, 'w') as f:
            writer = csv.DictWriter(f, row)
            writer.writeheader() # header
    else:
        with open(new_file, 'a') as f:
            writer = csv.writer(f)
            writer.writerow(row)

As you can see, we first fetch the whole table and then iterate over the tr elements first and then the td elements. In the first round of the iteration (tr), we use the information as a header for our csv file. Subsequently, we write all information as rows to the csv file.

Upvotes: 1

SIM

Reputation: 22440

Slightly cleaner approach using list comprehensions:

import csv
import requests
from bs4 import BeautifulSoup

page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'

page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    for items in page_content.find('table').find_all('tr'):
        data = [item.get_text(strip=True) for item in items.find_all(['th','td'])]
        print(data)
        writer.writerow(data)

Upvotes: 1

KunduK

Reputation: 33384

You can use python pandas library to import data into csv. Which is the easiest way to do that.

import pandas as pd
tables=pd.read_html("https://repo.vse.gmu.edu/ait/AIT580/580books.html")
tables[0].to_csv("output.csv",index=False)

To install pandas just use

pip install pandas

Upvotes: 3

How to extract Table contents from an HTML page using BeautifulSoup in Python?

Answers (3)

Related Questions