Maiko Ohkawa
Maiko Ohkawa

Reputation: 933

Can I create list from regular expressions?

I'm making a crawler. User can specify regular expression string to download data.

When user input form is:

http://xxx/abc[x-z]/image(9|10|11).png

I want to download these.

http://xxx/abcx/image9.png
http://xxx/abcy/image9.png
http://xxx/abcz/image9.png
http://xxx/abcx/image10.png
http://xxx/abcy/image10.png
http://xxx/abcz/image10.png
http://xxx/abcx/image11.png
http://xxx/abcy/image11.png
http://xxx/abcz/image11.png

Can I create the following list from the above regular expression string? Or, can I use each string in for-in block?

Upvotes: 5

Views: 7535

Answers (3)

MrAlexBailey
MrAlexBailey

Reputation: 5289

If you are wanting to take a user's given regex as an input and generate a list of strings you can use the library sre_yield:

However, be very aware that trying to parse every possible string of a regex can get out of hand very quickly. You'll need to be sure that your users are aware of the implications that wildcard characters and open ended or repeating groups can have on the number of possible matching strings.

As an example, your regex string: http://xxx/abc[x-z]/image(9|10|11).png does not escape the ., which is a wildcard for any character, so it will generate a lot of unexpected strings. Instead we'll need to escape it as seen in the example below:

>>> import sre_yield

>>> links = []

>>> for each in sre_yield.AllStrings(r'http://xxx/abc[x-z]/image(9|10|11)\.png'):
        links.append(each)

Or more simply links = list(sre_yield.AllStrings(r'http://xxx/abc[x-z]/image(9|10|11)\.png'))

The result is:

>>> links

['http://xxx/abcx/image9.png', 'http://xxx/abcy/image9.png', 
'http://xxx/abcz/image9.png', 'http://xxx/abcx/image10.png', 
'http://xxx/abcy/image10.png', 'http://xxx/abcz/image10.png', 
'http://xxx/abcx/image11.png', 'http://xxx/abcy/image11.png', 
'http://xxx/abcz/image11.png']

Upvotes: 10

Learner
Learner

Reputation: 5292

Simple try may be-alternative to the previous answers

lst = ['http://xxx/abc%s/image%s.png'%(x,y) for x, y in [(j,i) for i in (9,10,11) for j in ('x', 'y', 'z')]]

Omitted range and format function for quicker performance.


Analysis- I compared my way and the way posted by Jkdc

I ran both way 100000 times but mean shows that itertools approach is faster in terms of execution time-

from itertools import product
import time
from matplotlib import pyplot as plt
import numpy as np

prodct = []
native = []

def test():
    start = time.clock()
    lst = ['http://xxx/abc{}/image{}'.format(x, y) for x, y in product(('x', 'y', 'z'), range(9, 11))]
    end = time.clock()

    print '{0:.50f}'.format(end-start)
    prodct.append('{0:.50f}'.format(end-start))

    start1 = time.clock()
    lst = ['http://xxx/abc%s/image%s'%(x,y) for x, y in [(j,i) for i in (9,10,11) for j in ('x', 'y', 'z')]]
    end1 = time.clock()

    print '{0:.50f}'.format(end1-start1)
    native.append('{0:.50f}'.format(end1-start1))


for i in range(1,100000):
    test()


y = np.dot(np.array(native).astype(np.float),100000)
x= np.dot(np.array(prodct).astype(np.float),100000)

print np.mean(y)
print np.mean(x)

graph and getting result for native(no module) and itertools-product as below

for native 2.1831179834 for itertools-product 1.60410432562

Upvotes: 0

MrAlexBailey
MrAlexBailey

Reputation: 5289

You can use product() from the itertools builtin:

from itertools import product

for x, y in product(['x', 'y', 'z'], range(9, 12)):
    print 'http://xxx/abc{}/image{}'.format(x, y)

To build your list you can use a comprehension:

links = ['http://xxx/abc{}/image{}'.format(x, y) for x, y in product(['x', 'y', 'z'], range(9, 12))]

Upvotes: 1

Related Questions