Reputation: 667
Using python, how would you go about scraping both pictures and text from a website. For example say I wanted to scrape both the pictures and the text here, what python tools/libraries would I use? Any tutorials?
Upvotes: 1
Views: 1456
Reputation: 1579
Please never use regular expressions, it's not made for parsing html.
Normally I make use of the following combination of tools:
A approach would look like this and I hope you get the idea (The code just illustrated the concept, not tested, won't work):
import lxml.html
import requests
from cssselect import HTMLTranslator, SelectorError
from bs4 import UnicodeDammit
# first do the http request with requests module like
r = requests.get('http://example.com')
html = r.read()
# Try to parse/decode the HTML result with lxml and beautifoulsoup4
try:
doc = UnicodeDammit(html, is_html=True)
parser = lxml.html.HTMLParser(encoding=doc.declared_html_encoding)
dom = lxml.html.document_fromstring(html, parser=parser)
dom.resolve_base_href()
except Exception as e:
print('Some error occured while lxml tried to parse: {}'.format(e.msg))
return False
# Try to extract all data that we are interested in with CSS selectors!
try:
results = dom.xpath(HTMLTranslator().css_to_xpath('some css selector to target the DOM'))
for e in results:
# access elements like
print(e.get('href')) # access href attribute
print(e.text_content()) # the content as text
# or process further
found = e.xpath(HTMLTranslator().css_to_xpath('h3.r > a:first-child'))
except Exception as e:
print(e.__cause__)
Upvotes: 1
Reputation: 3194
requests, scrapy, and BeatidulSoup.
Scrapy is optional, but requests are becoming nonofficial standard, and I haven't seen bettern parsing tool than BS.
Upvotes: 0