BBedit
BBedit

Reputation: 8067

How to manipulate a URL string in order to extract a single piece?

I'm new to programming and Python.

Background

My program accepts a url. I want to extract the username from the url.

The username is the subdomain. If the subdomain is 'www', the username should be the main part of the domain. The rest of the domain should be discard (eg. '.com/', '.org/')

I've tried the following:

def get_username_from_url(url):
    if url.startswith(r'http://www.'):
        user = url.replace(r'http://www.', '', 1)
        user = user.split('.')[0]
        return user
    elif url.startswith(r'http://'):
        user = url.replace(r'http://', '', 1)
        user = user.split('.')[0]
        return user

easy_url = "http://www.httpwwwweirdusername.com/"    
hard_url = "http://httpwwwweirdusername.blogger.com/"

print get_username_from_url(easy_url)
# output = httpwwwweirdusername (good! expected.)

print get_username_from_url(hard_url)
# output = weirdusername (bad! username should = httpwwwweirdusername)

I've tried many other combinations using strip(), split(), and replace().

Could you advise me on how to solve this relatively simple problem?

Upvotes: 0

Views: 63

Answers (2)

user3960432
user3960432

Reputation:

Possible to do this with regular expressions (could probably modify the regex to be more accurate/efficient).

import re
url_pattern = re.compile(r'.*/(?:www.)?(\w+)')
def get_username_from_url(url):
    match = re.match(url_pattern, url)
    if match:
        return match.group(1)

easy_url = "http://www.httpwwwweirdusername.com/"
hard_url = "http://httpwwwweirdusername.blogger.com/"

print get_username_from_url(easy_url)
print get_username_from_url(hard_url)

Which yields us:

httpwwwweirdusername
httpwwwweirdusername

Upvotes: 0

alecxe
alecxe

Reputation: 474281

There is a module called urlparse that is specifically for the task:

>>> from urlparse import urlparse
>>> url = "http://httpwwwweirdusername.blogger.com/"
>>> urlparse(url).hostname.split('.')[0]
'httpwwwweirdusername'

In case of http://www.httpwwwweirdusername.com/ it would output www which is not desired. There are workarounds to ignore www part, like, for example, get the first item from the splitted hostname that is not equal to www:

>>> from urlparse import urlparse

>>> url = "http://www.httpwwwweirdusername.com/"
>>> next(item for item in urlparse(url).hostname.split('.') if item != 'www')
'httpwwwweirdusername'

>>> url = "http://httpwwwweirdusername.blogger.com/"
>>> next(item for item in urlparse(url).hostname.split('.') if item != 'www')
'httpwwwweirdusername'

Upvotes: 4

Related Questions