hellowrld
hellowrld

Reputation: 63

Scraping SVG charts

I am trying to scrape the following svg's from the following link:

https://finance.yahoo.com/quote/AAPL/analysts?p=AAPL

The portion I am trying to scrape is as follows:

Images Here

I do not need the words of the chart (just the graphs themselves). However, I have never scraped an svg image before and i'm not sure if it is possible. I looked around but could not find any useful python packages to directly do this.

I know that I can take a screenshot of the image with python using selenium and then use PIL to crop it and save it as an svg, but I am wondering if there is a more direct way to grab these charts off the page. Any useful packages or implementations would be helpful. Thank you.

Edit: Got some down votes but not sure why Here is how I would implement it in my way..

import sys
import time
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

class Screenshot(QWebView):
def __init__(self):
    self.app = QApplication(sys.argv)
    QWebView.__init__(self)
    self._loaded = False
    self.loadFinished.connect(self._loadFinished)

def capture(self, url, output_file):
    self.load(QUrl(url))
    self.wait_load()
    # set to webpage size
    frame = self.page().mainFrame()
    self.page().setViewportSize(frame.contentsSize())
    # render image
    image = QImage(self.page().viewportSize(), QImage.Format_ARGB32)
    painter = QPainter(image)
    frame.render(painter)
    painter.end()
    print 'saving', output_file
    image.save(output_file)

def wait_load(self, delay=0):
    # process app events until page loaded
    while not self._loaded:
        self.app.processEvents()
        time.sleep(delay)
    self._loaded = False

def _loadFinished(self, result):
    self._loaded = True

s = Screenshot()
s.capture('https://finance.yahoo.com/quote/AAPL/analysts?p=AAPL', 'yhf.png')

I would then use the crop function in PIL to take the images out of the charts.

Upvotes: 2

Views: 4704

Answers (1)

Using QWebView for web scraping seams weird to me, although I do realize that there is an advantage that it says to the server "I'm not a web scraper, I'm an embeded browser". Note that this approach is not bulletproof: your scraper can still be detected if it shows a behavior unusual for a human user.

This is how I would do it:

  1. Id use requests to download the page (may be through a proxy that hides your real ip addres to combat ip-bans).
  2. Then I'd parse the page using BeautifulSoup to get the url of the svg file you are trying to get.
  3. Then I'd download the svg file and convert it into an image using something like this

If you want to continue using Qt instead, look for methods in the web view that allow inspecting DOM or extracting the resources the view downloaded.

Upvotes: 2

Related Questions