Reputation: 4152
I am using Python.org version 2.7 64 bit on Vista 64 bit. I have the current Scrapy code which is working pretty well now for extracting text, but I'm a bit stuck as to how get data from tables on websites. I've had a look online for answers but I'm still not sure. As an example, I would like to get the data contained in this table for Wayne Rooney's goalscoring stats:
http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney The code I currently have is this:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
from scrapy.cmdline import execute
import re
class MySpider(Spider):
name = "Goals"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]
def parse(self, response):
titles = response.selector.xpath("normalize-space(//title)")
for titles in titles:
body = response.xpath("//p").extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')
execute(['scrapy','crawl','goals'])
What syntax do need to use in the xpath() statements to get tabular data?
Thanks
Upvotes: 1
Views: 7865
Reputation: 774
At first of all, for each symbol that you want you have to know what is the name associate with this symbol. For example, for goals I saw a <span>
element with title attribute equals "Goal" as well as a <span>
element with title attribute equals "Assist" for the symbol assist.
Considering these informations, you could check for each row retrieved if it contains a span with a desired title name that is associate with the symbol that you want to retrieve.
To get all Goals symbols of a row you could eval this row using the expression //span[@title="Goal"
as bellow:
for row in response.selector.xpath(
'//table[@id="player-fixture"]//tr[td[@class="tournament"]]'):
# Is this row contains goal symbols?
list_of_goals = row.xpath('//span[@title="Goal"')
if list_of_goals:
# Output goals text.
.
.
.
If it has retrieved a no empty list, it means there are goals symbols inside this row. So, you can output how many goals texts as many as the length of the returned list of spans, above.
Upvotes: 0
Reputation: 700
To scrape data, you usually identify the table, then loop over the rows. An html table like this one usually has this format:
<table id="thistable">
<tr>
<th>Header1</th>
<th>Header2</th>
</tr>
<tr>
<td>data1</td>
<td>data2</td>
</tr>
</table>
Here's an example of how to parse this fixture table:
from scrapy.spider import Spider
from scrapy.http import Request
from myproject.items import Fixture
class GoalSpider(Spider):
name = "goal"
allowed_domains = ["whoscored.com"]
start_urls = (
'http://www.whoscored.com/',
)
def parse(self, response):
return Request(
url="http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney",
callback=self.parse_fixtures
)
def parse_fixtures(self,response):
sel = response.selector
for tr in sel.css("table#player-fixture>tbody>tr"):
item = Fixture()
item['tournament'] = tr.xpath('td[@class="tournament"]/span/a/text()').extract()
item['date'] = tr.xpath('td[@class="date"]/text()').extract()
item['team_home'] = tr.xpath('td[@class="team home "]/a/text()').extract()
yield item
First, I identify the data rows with sel.css("table#player-fixture>tbody>tr")
and loop over the results, then extract data.
Edit: items.py (http://doc.scrapy.org/en/latest/topics/items.html)
class Fixture(Item):
tournament = Field()
date = Field()
team_home = Field()
Upvotes: 0
Reputation: 774
I just saw the page link and I got all rows of the table of tournaments you want throughout this Xpath expression: '//table[@id="player-fixture"]//tr[td[@class="tournament"]]'
.
I'll try to explain each part of this Xpath expression:
//table[@id="player-fixture"]
: retrieve the whole table with the id attribute player-fixture
as you can inspect in that page.//tr[td[@class="tournament"]]
: retrive all rows with the information of each match you want.You can use just this shorter //tr[td[@class="tournament"]]
Xpath expression as well. But I think is more consistent to use the prior expression as you are stating by that expression that you want all rows(tr
) under a certain table whose id
is unique(player-fixture
).
Once you get all rows, you can loop over them to get all information you need from each row entry.
Upvotes: 3