Reputation: 744
I am dealing with a dictionary that contains a lot of HTML links with incomplete urls, in the form:
<li><b>Random Thing</b>: <a href="dl_img/CM2233.jpg" target=_blank>JPG</a></li>
I am using BeautifulSoup to extract just the URL, and append it to the domain to have a complete URL. BeautifulSoup works well, but the string returned has a space at the start of the link. I am trying to use lstrip to remove this, but it has no effect.
I am using the following code:
for datadict in temp:
temp1 = svc.call(session, 'catalog_product.info', [datadict['product_id']]);
imagehtml = temp1['dl_image']
if temp1.get('set') != None:
if imagehtml != None and imagehtml !='':
soup = Soup(imagehtml, 'html.parser')
for a in soup.find_all('a', href=True):
print("www.example.com/media/", a['href'].lstrip())
Which outputs the following:
www.example.com/media/ dl_img/CM2233.jpg
What other techniques can I use to remove the whitespace at the beginning of what BeautifulSoup returns?
Upvotes: 2
Views: 203
Reputation: 473863
The space you see is just something print()
uses as a default delimiter when multiple arguments are passed into it. And you could change this separator if needed:
print("www.example.com/media/", a['href'], sep='')
In general though, you could and should use urllib.parse.urljoin()
to join parts of a URL:
from urllib.parse import urljoin
base_url = "www.example.com/media/"
for a in soup.find_all('a', href=True):
print(urljoin(base_url, a['href']))
Upvotes: 2