Reputation: 1
I am trying to scrape the x and y coordinates of the shots taken for the match between Everton and Aston Villa from the squawka webpage: http://epl.squawka.com/everton-vs-aston-villa/18-10-2014/english-barclays-premier-league/matches.
I've used Firebug element inspector to obtain the X-Paths for the circles (e.g. /html/body/div[2]/div[3]/div[2]/div[1]/div/div[15]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/svg/g[22]/circle). The pixel coordinates for each shot circle are contained in the cx and cy attributes.
I have tried to scrape these numbers using the scrapy module in Python, but with no success. I am very new to this and have basically adapted the code from the scrapy tutorial. The item file:
import scrapy
class SquawkaItem(scrapy.Item):
cx = scrapy.Field()
cy = scrapy.Field()
The spider file:
import scrapy
from squawka.items import SquawkaItem
class SquawkaSpider(scrapy.Spider):
name = "squawka"
allowed_domains = ["squawka.com"]
start_urls = ["http://epl.squawka.com/everton-vs-aston-villa/18-10-2014/english-barclays-premier-league/matches"]
def parse(self, response):
for sel in response.xpath('/html/body/div/div/div/div/div/div/div/div/div/div/div/div/svg/g/circle'):
cx = sel.xpath('[@cx]').extract()
cy = sel.xpath('[@cy]').extract()
print cx, cy
When I run this spider in my linux terminal, using 'scrapy crawl squawka' command, I get the following output:
2014-10-26 12:49:53+0000 [scrapy] INFO: Scrapy 0.25.0-222-g675fd5b started (bot: squawka)
2014-10-26 12:49:53+0000 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-10-26 12:49:53+0000 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'squawka.spiders', 'SPIDER_MODULES': ['squawka.spiders'], 'BOT_NAME': 'squawka'}
2014-10-26 12:49:54+0000 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState
2014-10-26 12:49:55+0000 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-10-26 12:49:55+0000 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-10-26 12:49:55+0000 [scrapy] INFO: Enabled item pipelines:
2014-10-26 12:49:55+0000 [squawka] INFO: Spider opened
2014-10-26 12:49:55+0000 [squawka] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-10-26 12:49:55+0000 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-10-26 12:49:56+0000 [squawka] DEBUG: Crawled (200) <GET http://epl.squawka.com/everton-vs-aston-villa/18-10-2014/english-barclays-premier-league/matches> (referer: None)
2014-10-26 12:49:56+0000 [squawka] INFO: Closing spider (finished)
2014-10-26 12:49:56+0000 [squawka] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 300,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16169,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 10, 26, 12, 49, 56, 402920),
'log_count/DEBUG': 1,
'log_count/INFO': 3,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 10, 26, 12, 49, 55, 261954)}
2014-10-26 12:49:56+0000 [squawka] INFO: Spider closed (finished)
As you can see it says that it hasn't crawled any web pages and there is no output data. I've got no ideas how to go about changing my code to get the data I want. Any ideas of changes to my code or other techniques I could use would be gratefully received. Thanks.
Upvotes: 0
Views: 1374
Reputation: 1
I've got the data I need now - thanks for the suggestions. I ended up using Selenium webdriver and Beautifulsoup in the end. Here's my code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get("http://epl.squawka.com/everton-vs-aston-villa/18-10-2014/english-barclays- premier-league/matches")
#find the elements which bring up the shots svg image
inputElement = driver.find_element_by_id("mc-stat-shot")
inputElement.click()
inputElement = driver.find_element_by_id("team2-select")
inputElement.click()
pageSource = driver.page_source
soup = BeautifulSoup(pageSource)
for circle in soup.find_all('circle'):
if circle['r'] == '6.5':
x = circle['cx']
y = circle['cy']
print x, y
else:
continue
driver.quit()
The code uses Selenium to click the elements on the webpage to bring up the correct svg image (whole pitch with all shots marked as circles on it). The pitch svg has dimensions of 480x366 pixels.Then I store the page source and get the attribute values I'm interested in (x and y coordinates of circles) using Beautifulsoup.
Upvotes: 0
Reputation: 20748
It seems to me that the SVG graphic is drawn by Javascript, and is not present in the source HTML.
You'll need either:
Upvotes: 0