Reputation: 2047
Sample html:
<div id="foobar" foo="hello;world;bar;baz">blablabla</div>
I'm using LinkExtractor
to get the attribute foo
, namely the string hello;world;bar;baz
. I wonder if it's possible to turn this string into multiple urls for the spider to follow, like hello.com
, world.com
, etc.
Any help is appreciated.
PS: the following might (or might not) be useful
process_value
argument of LxmlLinkExtractor
process_links
argument of Rules
Upvotes: 2
Views: 733
Reputation: 231
this will work for u
def url_break(value):
for url in value.split(';'):
yield url
class MyParserLinkExtractor(CrawlSpider):
rules = [Rule(SgmlLinkExtractor(, restrict_xpaths=YOUR_XPATH_LIST, process_value=url_break)),]
Upvotes: 0
Reputation: 474171
The problem is that, if you are using built-in LinkExtractor
, process_value
callable has to return a single link - it would fail here if it's, in your case, a list of links.
You would have to have a custom Parser Link Extractor which would support extracting multiple links per attribute, something like this (not tested):
class MyParserLinkExtractor(LxmlParserLinkExtractor):
def _extract_links(self, selector, response_url, response_encoding, base_url):
links = []
# hacky way to get the underlying lxml parsed document
for el, attr, attr_val in self._iter_links(selector.root):
# pseudo lxml.html.HtmlElement.make_links_absolute(base_url)
try:
attr_val = urljoin(base_url, attr_val)
except ValueError:
continue # skipping bogus links
else:
url = self.process_attr(attr_val)
if url is None:
continue
if isinstance(url, unicode):
url = url.encode(response_encoding)
# url here is a list
for item in url:
url = urljoin(response_url, item)
link = Link(item, _collect_string_content(el) or u'',
nofollow=rel_has_nofollow(el.get('rel')))
links.append(link)
return unique_list(links, key=lambda link: link.url) \
if self.unique else links
Then, based on it, define your actual Link Extractor:
class MyLinkExtractor(LxmlLinkExtractor):
def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href',), canonicalize=True,
unique=True, process_value=None, deny_extensions=None, restrict_css=()):
tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
tag_func = lambda x: x in tags
attr_func = lambda x: x in attrs
lx = MyParserLinkExtractor(tag=tag_func, attr=attr_func,
unique=unique, process=process_value)
super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
allow_domains=allow_domains, deny_domains=deny_domains,
restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
canonicalize=canonicalize, deny_extensions=deny_extensions)
You would then need to have tags
, attrs
and process_value
defined:
MyLinkExtractor(tags=["div"], attrs=["foo"], process_value=extract_links)
where extract_links
is defined as:
def extract_links(value):
return ["https://{}.com".format(part) for part in value.split(";")]
Upvotes: 2