Reputation: 199

Get number sequence after an specific string in url text

I'm coding a python script to check a bunch of URL's and get their ID text, the URL's follow this sequence:

http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
Up to
http://XXXXXXX.XXX/index.php?id=YYYYYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX

What I'm trying to do is get only the numbers after the id= and before the &

I've tried to use the regex (\D+)(\d+) but I'm also getting the auth numbers too.

Any suggestion on how to get only the id sequence?

Upvotes: 1

Answers (6)

Gustav

Reputation: 67

Use the regex id=[0-9]+:

pattern = "id=[0-9]+"
id = re.findall(pattern, url)[0].split("id=")[1]

If you do it this way, there is no need for &auth to follow the id, which makes it very versatile. However, the &auth won't make the code stop working. It works for the edge cases, as well as the simple ones.

Upvotes: 0

PythonProgrammi

Reputation: 23463

variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
    http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
    http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX""".splitlines()

for v in variables:
    p1 = v.split("id=")[1]
    p2 = p1.split("&")[0]
    print(p2)

outoput:

YY
YYY
YYYY

If you prefer regex

import re

variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""

pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)

output:

['YY', 'YYY', 'YYYY']

I don't know if you mean with only numbers after id= and before & you mean that there could be letters and numbers between those letters, so I though to this

import re


variables = """http://XXXXXXX.XXX/index.php?id=5Y44Y&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=Y2242YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=5YY453YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""

pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)

x2 = []
for p in x:
    x2.append(re.sub("\\D", "", p))
print(x2)

Output:

['5Y44Y', 'Y2242YY', '5YY453YY']
['544', '2242', '5453']

Upvotes: 0

xana

Reputation: 499

These are URL addresses, so I would just use url parser in that case.

Look at urllib.parse

Use urlparse to get query parameters, and then parse_qs to get query dict.

import urllib.parse as p
url = "http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"
query = p.urlparse(url).query
params = p.parse_qs(query)
print(params['id'])

Upvotes: 2

Leo Arad

Reputation: 4472

You can try this regex

import re

urls = ["http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"]
for url in urls:
    id_value = re.search(r"id=(.*)(?=&)", url).group(1)
    print(id_value)

that will get you the id value from the URL

YY
YYY
YYYY

Upvotes: 0

Code Pope

Reputation: 5459

Another way is to use split:

string = 'http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX'
string.split('id=')[1].split('&auth=')[0]

Output:

YY

Upvotes: 2

a_guest

Reputation: 36329

You can include the start and stop tokens in the regex:

pattern = r'id=(\d+)(?:&|$)'

Upvotes: 0

Get number sequence after an specific string in url text

Answers (6)

Related Questions