regular expressions to extract pages beguinning with the same url

Question

I would like write a regular expression which allow me to extract pages beguinning with the same url.

For example : I have the Following url


https://www.afp.com/fr/infos/334/soudan-le-president-dechu-en-prison-les-manifestants-toujours-mobilises-doc-1fp9z64

And want to only the url which beguin with :

https://www.afp.com/fr/infos/334/

so that i will have :

https://www.afp.com/fr/infos/334/le barça-est-gagnant
https://www.afp.com/fr/infos/334/mort au Zimbabwe
https://www.afp.com/fr/infos/334/le président français

So I tried

https://www.afp.com/fr/infos/334/*
https://www.afp.com/fr/infos/334/[^abc]*

It is not working I have to put the regular expression in a software which do the crawling , the software is written in python

Oliver H. D. · Accepted Answer

I would just use something like:

import re

list = []

myStr = "https://www.afp.com/fr/infos/334/soudan-le-president-dechu-en-prison-les-manifestants-toujours-mobilises-doc-1fp9z64"
if "https://www.afp.com/fr/infos/334/" in myStr:
    list.append(myStr)

or use url.startswith() like the other commenter recommended.

regular expressions to extract pages beguinning with the same url

Answers (2)

Related Questions