Reputation: 121
I have a string:
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
I want to get a result like this :
[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', '')]
I tried like :
match = re.findall("([fh]t*ps?|file):[\\/]*(.*?)(:\d+|(?=[\\\/]))", line)
And than i got :
[["https", "dbwebb.se", ""], ["ftp", "bth.com", ":32"], ["file", "localhost", ":8585"], ["http", "v2-dbwebb.se", ""]]
There is one diffrence, you can se ":32" and ":8585". How can i do to get just "32" and "8585" and not the stupid ":" Thanx!
Upvotes: 0
Views: 76
Reputation: 626802
I suggest
import re
line = line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
match = re.findall(r"([fh]t*ps?|file)://([^/]*?)(?::(\d+))?(?:/|$)", line)
print(match)
See the Python demo
The main point is (?::(\d+))?(?:/|$
part where :
and 1+ digits part is optional ((?...)?
matches 1 or 0 times) and (?:/|$)
matches a /
or end of string.
Details
([fh]t*ps?|file)
- Group 1 (the first item in the tuple): a literal
[fh]t*ps?
- f
or h
, zero or more t
, p
and 1 or 0 s
s|
- orfile
- file
substring://
- a literal substring([^/]*?)
- Group 2 (the second item in the tuple): any 0 or more chars other than /
(?::(\d+))?
- an optional sequence of:
:
- a colon(\d+)
- Group 2 (the third item in the tuple): one or more digits(?:/|$)
- a :
or end of string.Upvotes: 1
Reputation: 142146
Instead of a regex, why not split on the ,
and then use Python's urllib.parse.urlparse
, eg:
from urllib.parse import urlparse
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
output = [urlparse(url) for url in line.split(', ')]
Gives you:
[ParseResult(scheme='https', netloc='dbwebb.se', path='/kunskap/uml', params='', query='', fragment='sequence'),
ParseResult(scheme='ftp', netloc='bth.com:32', path='/files/im.jpeg', params='', query='', fragment=''),
ParseResult(scheme='file', netloc='localhost:8585', path='/zipit', params='', query='', fragment=''),
ParseResult(scheme='http', netloc='v2-dbwebb.se', path='/do%hack', params='', query='', fragment='')]
Then filter out the elements you want:
wanted = [(url.scheme, url.hostname, url.port or '') for url in output]
Which gives you:
[('https', 'dbwebb.se', ''),
('ftp', 'bth.com', 32),
('file', 'localhost', 8585),
('http', 'v2-dbwebb.se', '')]
Upvotes: 1
Reputation: 89557
Regex isn't the good tool to parse urls, there's a dedicated library to do this complicated task urllib:
from urllib.parse import urlparse
line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
result = []
for i in line.split(', '):
o = urlparse(i)
result.append([o.scheme, o.hostname, o.port])
Upvotes: 1