kely789456123
kely789456123

Reputation: 595

regular expressions to extract pages beguinning with the same url

I would like write a regular expression which allow me to extract pages beguinning with the same url.

For example : I have the Following url


https://www.afp.com/fr/infos/334/soudan-le-president-dechu-en-prison-les-manifestants-toujours-mobilises-doc-1fp9z64

And want to only the url which beguin with :

https://www.afp.com/fr/infos/334/

so that i will have :

https://www.afp.com/fr/infos/334/le barça-est-gagnant
https://www.afp.com/fr/infos/334/mort au Zimbabwe
https://www.afp.com/fr/infos/334/le président français


So I tried

https://www.afp.com/fr/infos/334/*
https://www.afp.com/fr/infos/334/[^abc]*

It is not working I have to put the regular expression in a software which do the crawling , the software is written in python

Upvotes: 0

Views: 89

Answers (2)

wpercy
wpercy

Reputation: 10090

You should just use str.startswith() like this

if url.startswith('https://www.afp.com/fr/infos/334/'):
    # do stuff with url

Upvotes: 4

Oliver H. D.
Oliver H. D.

Reputation: 333

I would just use something like:

import re

list = []

myStr = "https://www.afp.com/fr/infos/334/soudan-le-president-dechu-en-prison-les-manifestants-toujours-mobilises-doc-1fp9z64"
if "https://www.afp.com/fr/infos/334/" in myStr:
    list.append(myStr)

or use url.startswith() like the other commenter recommended.

Upvotes: 3

Related Questions