Jake Rankin
Jake Rankin

Reputation: 744

Unable to remove leading space from result of passing URL with BeautifulSoup

I am dealing with a dictionary that contains a lot of HTML links with incomplete urls, in the form:

<li><b>Random Thing</b>: <a href="dl_img/CM2233.jpg" target=_blank>JPG</a></li>

I am using BeautifulSoup to extract just the URL, and append it to the domain to have a complete URL. BeautifulSoup works well, but the string returned has a space at the start of the link. I am trying to use lstrip to remove this, but it has no effect.

I am using the following code:

for datadict in temp:
    temp1 = svc.call(session, 'catalog_product.info', [datadict['product_id']]);
    imagehtml = temp1['dl_image']
    if temp1.get('set') != None:
        if imagehtml != None and imagehtml !='':
            soup = Soup(imagehtml, 'html.parser')
            for a in soup.find_all('a', href=True):
                print("www.example.com/media/", a['href'].lstrip())

Which outputs the following:

www.example.com/media/ dl_img/CM2233.jpg

What other techniques can I use to remove the whitespace at the beginning of what BeautifulSoup returns?

Upvotes: 2

Views: 203

Answers (1)

alecxe
alecxe

Reputation: 473863

The space you see is just something print() uses as a default delimiter when multiple arguments are passed into it. And you could change this separator if needed:

print("www.example.com/media/", a['href'], sep='')

In general though, you could and should use urllib.parse.urljoin() to join parts of a URL:

from urllib.parse import urljoin

base_url = "www.example.com/media/"

for a in soup.find_all('a', href=True):
    print(urljoin(base_url, a['href']))

Upvotes: 2

Related Questions