Reputation: 10580
I am learning about scrapy. I am using scrapy 0.20 that is why I am following this tutorial. http://doc.scrapy.org/en/0.20/intro/tutorial.html
I undrstood the concepts. However, I have one thing yet.
In this statement
sel.xpath('//title/text()').re('(\w+):')
the output is
[u'Computers', u'Programming', u'Languages', u'Python']
what is re('(\w+):') using for please?
this statement
sel.xpath('//title/text()').extract()
has this output:
[u'Open Directory - Computers: Programming: Languages: Python: Books']
why is the comma ,
added between the elements?
Also, all the ':' are removed.
Moreover: is this a python pure syntax please?
Upvotes: 1
Views: 1770
Reputation: 8893
This is a regular expression (regex), and is a whole world unto itself.
(\w+): Will return any text that ends in a colon (but does not return the colon) Here is an example of how it works with the ":" getting removed
(\w+:) Will return any text that ends in a colon (and will also return the colon) Here is an example of how it works with the ":" staying in
Also, if you want to learn about regex, Codecademy has a good python course
Upvotes: 2
Reputation: 239563
(\w+):
is a Regular Expression, which matches any word which ends with :
and groups all the word characters ([a-zA-Z_]
).
The output does not have :
, because this method returns all the captured groups.
The results are returned as a Python list. When a list is represented as a string, the elements are separated by ,
.
\w
is a shortform for [a-zA-Z_]
Quoting from Python Regular Expressions Page,
\w
When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Upvotes: 1