Reputation: 1409
I'm building a crawler for a webpage that for some reason decided to hold ID numbers for the items I'm extracting in meta tags as such
<meta content="1001662613">
where the number in quotation marks is the number I want.
I tried using the xpath
Id = title.select('//meta [@content]').extract()
But results for that come out empty. Using
Id = title.select('//meta/@content').extract()
in turn give me the entire page's source code after the meta tag...
Is there any way to extract the number from the tag itelf, instead of trying to go into the tag (which is empty)?
For reference, here's an example of the section of the page's source where the ID number is located
<link rel="stylesheet" type="text/css" href="/ccss/2076d1c6bea75c5b6f4c753b3b4920b6_14bfe2d5b91d791bc05282634acdfb68.css" />
<script type="text/javascript" src="/cjs/986570aebf4e6cef6e0a52faa9c5a8a2_f4ceae6565fa007f39ee4e0abe02ab7b.js"></script>
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.8.16/jquery-ui.min.js"></script>
<script type="text/javascript" src="/cjs/a373b58f85b5e68c60f3edc35b348e14_a2abaa7837c3e1ccda94d6fe6b0f7a8f.js"></script>
<meta content="1001657519"/>
<link href="http://www.groupon.com.uy/descuentos/montevideo/sushi-go-26-12-7" rel="canonical" />
<link href="http://www.groupon.com.uy/deals/feed.rss" type="application/rss+xml" rel="alternate" title="Groupon - Descuentos" />
<meta name="title" content="Desde $264 en vez de $462 por 24, 48 o 72 piezas de sushi en Sushi Go"/>
Upvotes: 3
Views: 6181
Reputation: 474191
//meta/@content
returns multiple results because of multiple meta
tags on the page. Just filter the one that contains digits:
ids = title.select('//meta/@content').extract()
print [id for id in ids if id.isdigit()]
Hoep that helps.
Upvotes: 2