Removing HTML from Web Scrape

Question

New to python and trying to learn to web scrape. I'm currently running into a problem trying to figure out how to remove the html from the final product.

from bs4 import BeautifulSoup
import requests
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

dataFiles = soup.find('div', class_='span-84')
# fullDates = dataFiles.findAll('a')

print(dataFiles)

When I run this I get results like:

block of text including all the html (with the html I'm not sure how to post it here where it actually shows up like it does in terminal versus just looking normal here.)

I'm looking for:

Saturday, April 11, 2020 Saturday, April 04, 2020 Saturday, March 28, 2020 Saturday, March 21, 2020

...and so on...

I've tried:

from bs4 import BeautifulSoup
import requests
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

dataFiles = soup.find('div', class_='span-84')

print(dataFiles.text.strip)

That produces:

Using just: print(dataFiles.strip) or print(dataFiles.text) produces output: None

What am I doing wrong? How can I get the data stripped of the HTML?

dabingsou · Accepted Answer

Try this.

from simplified_scrapy import SimplifiedDoc,req,utils
url = 'http://web.mta.info/developers/turnstile.html'
html = req.get(url)
doc = SimplifiedDoc(html)
dataFiles = doc.select('div.span-84')
# # Get links
fullDates = dataFiles.selects('a')
print ([a.text.split(',',1) for a in fullDates])
# print ([(a.href,a.text) for a in fullDates])
# # Completion link
# print ([utils.absoluteUrl(url, a.href) for a in fullDates])
# print(dataFiles.text)

Result:

[['Saturday', ' April 18, 2020'], ['Saturday', ' April 11, 2020'],...

Removing HTML from Web Scrape

Answers (1)

Related Questions