Reputation: 551
I'm trying to get my hands on Scrapy and right now I try to extract information from an etymology website: http://www.etymonline.com Right now, I just want to get the words and their raw description. This is how a usual HTML code chunk is presented in etymonline:
<dt>
<a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a>
<a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com">
<img src="graphics/dictionary.gif" width="16" height="16" alt="Look up address at Dictionary.com" title="Look up address at Dictionary.com"/>
</a>
</dt>
<dd>
1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="crossreference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).
</dd>
The word is contained in the <dt>
tag and the description in the next sibling, <dd>
tag.
To get the list of word on a page like http://www.etymonline.com/index.php?l=a&p=9&allowed_in_frame=0, it's possible to write word = sel.xpath('//dl/dt/a/text()').extract()
.
Then I tried to loop over this list of words and extract the relevant information using this line of code info = selInfo.xpath("//dl/dt[a='"+word[i]+"']/following-sibling::dd")
. But it doesn't seem to work. Any ideas ?
Upvotes: 8
Views: 10669
Reputation: 5814
A solution using following-sibling.
class SingleSpider(scrapy.Spider):
name = "etym"
allowed_domains = ["etymonline.com"]
start_urls = [
"http://www.etymonline.com/index.php?l=d&allowed_in_frame=0"]
def parse(self, response):
for nodes in response.xpath('//dl'):
for i in nodes.xpath('dt'):
print i.xpath('a/text()').extract()
print i.xpath('following-sibling::dd[1]/text()').extract()
Basically:
Here it is an extract of the output:
[u'daiquiri (n.)'] [u'type of alcoholic drink, 1920 (first recorded in F. Scott Fitzgerald), from ', u', name of a district or village in eastern Cuba.']
[u'dairy (n.)'] [u'late 13c., "building for making butter and cheese; dairy farm," formed with Anglo-French ', u' affixed to Middle English ', u' (in ', u' "dairymaid"), from Old English ', u' "kneader of bread, housekeeper, female servant" (see ', u' (n.1)). The purely native word was ', u'.']
[u'dais (n.)'] [u'mid-13c., from Anglo-French ', u', Old French ', u' "table, platform," from Latin ', u' "disk-shaped object," also, by medieval times, "table," from Greek ', u' "quoit, disk, dish" (see ', u' (n.)). Died out in English c.1600, preserved in Scotland, revived 19c. by antiquarians.']
Upvotes: 1
Reputation: 20748
To get to the <dd>
after a <dt>
, you can use the following-sibling
axis, you're correct.
following-sibling::dd
with select all dd
elements after the context node. Therefore you need to restrict the XPath to only the first one, using a position predicate [1]
.
For each dt
element you get out of //dl/dt
, you select following-sibling::dd[1]
.
Here's a sample session using scrapy shell
for the term "address":
$ scrapy shell "http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none"
...
2014-11-26 10:34:53+0100 [default] DEBUG: Crawled (200) <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f1396cc6950>
[s] item {}
[s] request <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s] response <200 http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s] settings <scrapy.settings.Settings object at 0x7f1397399bd0>
[s] spider <Spider 'default' at 0x7f13966c05d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: for dt in response.xpath('//dl/dt'):
print "Word:", dt.xpath('string(a)').extract()
print "Definition:", dt.xpath('string(following-sibling::dd[1])').extract()
print
...:
Word: [u'address (n.)']
Definition: [u'1530s, "dutiful or courteous approach," from address (v.) and from French adresse. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']
Word: [u'addressee (n.)']
Definition: [u'1810; see address (v.) + -ee.']
Word: [u'address (v.)']
Definition: [u'early 14c., "to guide or direct," from Old French adrecier "go straight toward; straighten, set right; point, direct" (13c.), from Vulgar Latin *addirectiare "make straight," from Latin ad "to" (see ad-) + *directiare, from Latin directus "straight, direct" (see direct (v.)). Late 14c. as "to set in order, repair, correct." Meaning "to write as a destination on a written message" is from mid-15c. Meaning "to direct spoken words (to someone)" is from late 15c. Related: Addressed; addressing.']
Word: [u'salutatorian (n.)']
Definition: [u'1841, American English, from salutatory "of the nature of a salutation," here in the specific sense "designating the welcoming address given at a college commencement" (1702) + -ian. The address was originally usually in Latin and given by the second-ranking graduating student.']
...
Word: [u'reverend (adj.)']
Definition: [u'early 15c., "worthy of respect," from Middle French reverend, from Latin reverendus "(he who is) to be respected," gerundive of revereri (see reverence). As a form of address for clergymen, it is attested from late 15c.; earlier reverent (late 14c. in this sense). Abbreviation Rev. is attested from 1721, earlier Revd. (1690s). Very Reverend is used of deans, Right Reverend of bishops, Most Reverend of archbishops.']
Word: [u'nun (n.)']
Definition: [u'Old English nunne "nun, vestal, pagan priestess, woman devoted to religious life under vows," from Late Latin nonna "nun, tutor," originally (along with masc. nonnus) a term of address to elderly persons, perhaps from children\'s speech, reminiscent of nana (compare Sanskrit nona, Persian nana "mother," Greek nanna "aunt," Serbo-Croatian nena "mother," Italian nonna, Welsh nain "grandmother;" see nanny).']
In [2]:
Upvotes: 5
Reputation: 20583
The idea for your xpath to work is not to loop
the extracted list, but within a parent node in xpath.
I don't have scrapy on my mac at the moment, but the technique here should similarly apply, like this:
# I use lxml for loose html string parsing
from lxml import html
s = '''<dt><a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a> <a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com"><img src="graphics/dictionary.gif" width="16" height="16" alt="Look up address at Dictionary.com" title="Look up address at Dictionary.com" /></a></dt>
<dd>1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="crossreference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).</dd>'''
sel = html.fromstring(s)
# rather than extracting the words straight away, you loop from the parent xpath
for nodes in sel.xpath('//dt'):
# then access a node to get the text
print nodes.xpath('a/text()')
# and go back to parent and search the dd node
print nodes.xpath('../dd/text()')
# sample results
['address (n.)']
['1530s, "dutiful or courteous approach," from ', ' (v.) and from French ', '. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']
Hope this helps.
Upvotes: 2