Reputation: 392
So here is the table that I am trying to get data from
<table class="statBlock" cellspacing="0">
<tr>
<th>
<a href="/srd/magicOverview/spellDescriptions.htm#level">Level</a>:
</th>
<td>
<a href="/srd/spellLists/clericSpells.htm#thirdLevelClericSpells">Clr 3</a>
</td>
</tr>
<tr>
<th>
<a href="/srd/magicOverview/spellDescriptions.htm#components">Components</a>:
</th>
<td>
V, S
</td>
</tr>
<tr>
<th>
<a href="/srd/magicOverview/spellDescriptions.htm#castingTime">Casting Time</a>:
</th>
<td>
1 <a href="/srd/combat/actionsInCombat.htm#standardActions">standard action</a>
</td>
</tr>
ETC...
This is the scrapy code that I have so far for parsing
for sel in response.xpath('//tr'):
string = " ".join(response.xpath('//th/a/text()').extract()) + ":" + " ".join(response.xpath('//td/text()').extract())
print string
But this yields a result like this:
Level Components Casting Time Range Effect Duration Saving Throw Spell Resistance:V, S, M, XP 12 hours 0 ft. One duplicate creature Instantaneous None No
When the output should look something like
Level: CLR 1 Components:V, S, M etc...
Essentially, for some reason it isn't looping through each row of the table and finding the one and cell for each and sticking them together, it's finding all of the data from and all of the data from and then sticking those two sets together. I assume my for statement needs to be fixed - how do I go about getting it to examine each row individually?
Upvotes: 1
Views: 772
Reputation: 90999
When you query an xpath like -
response.xpath('//th/a/text()')
This would return all the <th>
elements with <a>
elements in them (that have a text()
) . That is not what you want . You should do -
for sel in response.xpath('//tr'):
string = " ".join(sel.xpath('.//th/a/text()').extract()) + ":" + " ".join(sel.xpath('.//td/text()').extract())
print string
The dot in the xpath inside the loop, is so that xpath is run relative to the current node, not from the starting node.
More details on relative xpaths at Working with Relative XPaths
Upvotes: 2