Reputation: 1
I am trying to scrape the following URL and so far have been able to use the following code to extract out the ul
elements.
from bs4 import BeautifulSoup
import urllib
import csv
import requests
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content.prettify())
page_content.ul
However, my goal is to extract the information contained within the table into a csv file. How can I go about doing this judging from my current code?
Upvotes: 0
Views: 1680
Reputation: 1226
Although I think that KunduKs answer provides an elegant solution using pandas
, I would like to give you another approach, since you explicitly asked how to go on from your current code (which is using the csv
module and BeautifulSoup).
from bs4 import BeautifulSoup
import csv
import requests
new_file = '/path/to/new/file.csv'
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
table = page_content.find('table')
for i,tr in enumerate(table.findAll('tr')):
row = []
for td in tr.findAll('td'):
row.append(td.text)
if i == 0: # write header
with open(new_file, 'w') as f:
writer = csv.DictWriter(f, row)
writer.writeheader() # header
else:
with open(new_file, 'a') as f:
writer = csv.writer(f)
writer.writerow(row)
As you can see, we first fetch the whole table and then iterate over the tr
elements first and then the td
elements. In the first round of the iteration (tr
), we use the information as a header for our csv file. Subsequently, we write all information as rows to the csv file.
Upvotes: 1
Reputation: 22440
Slightly cleaner approach using list comprehensions:
import csv
import requests
from bs4 import BeautifulSoup
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
for items in page_content.find('table').find_all('tr'):
data = [item.get_text(strip=True) for item in items.find_all(['th','td'])]
print(data)
writer.writerow(data)
Upvotes: 1
Reputation: 33384
You can use python pandas library to import data into csv. Which is the easiest way to do that.
import pandas as pd
tables=pd.read_html("https://repo.vse.gmu.edu/ait/AIT580/580books.html")
tables[0].to_csv("output.csv",index=False)
To install pandas just use
pip install pandas
Upvotes: 3