Reputation: 595
I would like write a regular expression which allow me to extract pages beguinning with the same url.
For example : I have the Following url
https://www.afp.com/fr/infos/334/soudan-le-president-dechu-en-prison-les-manifestants-toujours-mobilises-doc-1fp9z64
And want to only the url which beguin with :
https://www.afp.com/fr/infos/334/
so that i will have :
https://www.afp.com/fr/infos/334/le barça-est-gagnant
https://www.afp.com/fr/infos/334/mort au Zimbabwe
https://www.afp.com/fr/infos/334/le président français
So I tried
https://www.afp.com/fr/infos/334/*
https://www.afp.com/fr/infos/334/[^abc]*
It is not working I have to put the regular expression in a software which do the crawling , the software is written in python
Upvotes: 0
Views: 89
Reputation: 10090
You should just use str.startswith()
like this
if url.startswith('https://www.afp.com/fr/infos/334/'):
# do stuff with url
Upvotes: 4
Reputation: 333
I would just use something like:
import re
list = []
myStr = "https://www.afp.com/fr/infos/334/soudan-le-president-dechu-en-prison-les-manifestants-toujours-mobilises-doc-1fp9z64"
if "https://www.afp.com/fr/infos/334/" in myStr:
list.append(myStr)
or use url.startswith() like the other commenter recommended.
Upvotes: 3