siryx
siryx

Reputation: 125

Fast way to extract a list of URLs and check validity

I am working on a chat bot. I want it to post matching data from an API whenever a link to a gallery on an imageboard is posted. The gallery link looks like this

https://example.com/a/1234/a6fb1049/

where 1234 is a positive number (id) and a6fb1049 is a hexadecimal String of fixed length 10 (token). Right now I am only able to process messages starting with a gallery link.

if message_object.content.startswith("https://example.com/a/"):

I am looking for a fast way to process the message string, because every time a message is sent this will be invoked.

if message_object.content.startswith("https://example.org/a/"):

        temp = message_object.content.split("/")

        # Check if link is actually a valid link
        if temp[2] == "example.org" and temp[3] == "a" and 0 < int(temp[4]) and len(temp[5]) == 10:
            gallery_id = temp[4]
            gallery_token = temp[5]

            response = requests.post(url, payload, json_request_headers)

I thought about using urllib.parse.urlparse and posixpath.split to split the string and checking the different substrings, but I feel like this is inefficient.

Because I am really not good with Regex, this is all I came up with.

searchObj = re.search( r'https://example.org/a/(.*)/(.*)/', message)

It's fine if there is just one matching pattern, and it's right, but as soon as there are two links this is already failing.

I would rather get all of the messages matching links in a list then iterate over the list and check the header of the page if the link is valid. Then create an API request to retrieve the data.

The regular expressions to match URLs on Stackoverflow don't show how you only match such specific cases, so I am sorry if this is a newb question.

Upvotes: 0

Views: 134

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89574

I don't understand why you wrote: https://example.org/a/(.*)/(.*)/ when at the same time you precisely know that "1234 is a positive number (id) and a6fb1049 is a hexadecimal String of fixed length 10" (<= or perhaps 8). Translating this sentence into a pattern is very easy and needs only simple notions:

re.findall(r'(https://example.org/a/([0-9]+)/([0-9a-f]{10})/)', message)

re.findall is the method to get several results (re.search returns only the first result, see the re module manual)

You obtain a list of lists where each item contains matched parts enclosed by round brackets (capture groups), feel free to put them where you want.

If you want to know if there are links that don't match the format you want, you can also use something like this:

re.findall(r'(https://example.org/a/(?:([0-9]+)/([0-9a-f]{10})/)|.*)', message)

Then you only have to test is the group 2 is None or not to know if a link has the good format.

Upvotes: 1

Related Questions