mlo
mlo

Reputation: 667

Picture and text scraping with python

Using python, how would you go about scraping both pictures and text from a website. For example say I wanted to scrape both the pictures and the text here, what python tools/libraries would I use? Any tutorials?

Upvotes: 1

Views: 1456

Answers (2)

Nikolai Tschacher
Nikolai Tschacher

Reputation: 1579

Please never use regular expressions, it's not made for parsing html.

Normally I make use of the following combination of tools:

  • requests module
  • lxml.html
  • beautifulsoup4 to detect the website encoding

A approach would look like this and I hope you get the idea (The code just illustrated the concept, not tested, won't work):

import lxml.html
import requests
from cssselect import HTMLTranslator, SelectorError
from bs4 import UnicodeDammit

# first do the http request with requests module like
r = requests.get('http://example.com')
html = r.read()

# Try to parse/decode the HTML result with lxml and beautifoulsoup4
try:
    doc = UnicodeDammit(html, is_html=True)
    parser = lxml.html.HTMLParser(encoding=doc.declared_html_encoding)
    dom = lxml.html.document_fromstring(html, parser=parser)
    dom.resolve_base_href()
except Exception as e:
    print('Some error occured while lxml tried to parse: {}'.format(e.msg))
    return False

# Try to extract all data that we are interested in with CSS selectors!
try:
    results = dom.xpath(HTMLTranslator().css_to_xpath('some css selector to target the DOM'))
    for e in results:
        # access elements like
        print(e.get('href')) # access href attribute
        print(e.text_content()) # the content as text
        # or process further
        found = e.xpath(HTMLTranslator().css_to_xpath('h3.r > a:first-child'))
except Exception as e:
    print(e.__cause__)

Upvotes: 1

Filip Malczak
Filip Malczak

Reputation: 3194

requests, scrapy, and BeatidulSoup.

Scrapy is optional, but requests are becoming nonofficial standard, and I haven't seen bettern parsing tool than BS.

Upvotes: 0

Related Questions