Marco Dinatsoli
Marco Dinatsoli

Reputation: 10580

scrapy python re statement

I am learning about scrapy. I am using scrapy 0.20 that is why I am following this tutorial. http://doc.scrapy.org/en/0.20/intro/tutorial.html

I undrstood the concepts. However, I have one thing yet.

In this statement

sel.xpath('//title/text()').re('(\w+):')

the output is

[u'Computers', u'Programming', u'Languages', u'Python']

what is re('(\w+):') using for please?

to help answering:

this statement

sel.xpath('//title/text()').extract()

has this output:

[u'Open Directory - Computers: Programming: Languages: Python: Books']

why is the comma , added between the elements? Also, all the ':' are removed.

Moreover: is this a python pure syntax please?

Upvotes: 1

Views: 1770

Answers (2)

e h
e h

Reputation: 8893

This is a regular expression (regex), and is a whole world unto itself.

(\w+): Will return any text that ends in a colon (but does not return the colon) Here is an example of how it works with the ":" getting removed

(\w+:) Will return any text that ends in a colon (and will also return the colon) Here is an example of how it works with the ":" staying in

Also, if you want to learn about regex, Codecademy has a good python course

Upvotes: 2

thefourtheye
thefourtheye

Reputation: 239563

(\w+):

is a Regular Expression, which matches any word which ends with : and groups all the word characters ([a-zA-Z_]).

The output does not have :, because this method returns all the captured groups.

The results are returned as a Python list. When a list is represented as a string, the elements are separated by ,.

\w is a shortform for [a-zA-Z_]

Quoting from Python Regular Expressions Page,

\w

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

Upvotes: 1

Related Questions