Aquiles Páez
Aquiles Páez

Reputation: 573

Python: trouble getting URL of href using BeautifulSoup

I'm learning how to do web scraping in Python using BeautifulSoup first. I've encountered a bit of an issue I'm not sure how to solve, I'll present you this snippet of my code:

from bs4 import BeautifulSoup
import requests

start_url = "https://www1.interactivebrokers.com/en/index.php?f=2222&exch=nasdaq&showcategories=STK#productbuffer"

# Download the HTML from start_url:
downloaded_html = requests.get(start_url)

# Parse the HTML with BeautifulSoup and create a soup object
soup = BeautifulSoup(downloaded_html.text)
# Select table where the data is:
rawTable = soup.select('table.table.table-striped.table-bordered tbody')[2]
url = rawTable.find_all('a',{'class':'linkexternal'})
print(url[0])
print(url[0].get('href'))

The outcome of the first print line is the first row after the header of the table containing company information (in the link you'll see it). The second outcome is just to get the href field, meant to be for a pop-up page containing further information, which I'll paste here:

javascript:NewWindow('https://contract.ibkr.info/index.php?action=Details&site=GEN&conid=48811132','Details','600','600','custom','front');

The actual URL, looks like this when I manually click it:

https://contract.ibkr.info/v3.10/index.php?action=Details&site=GEN&conid=48811132

Is there a command in BeautifulSoup that can help me get this? Or another Python module I can combine with BeautifulSoup in order to capture the URL of the pop-up? I don't want to use regular expressions to get this.

Thanks in advance for any help.

Upvotes: 0

Views: 241

Answers (2)

yash jain
yash jain

Reputation: 50

Well behind the scene almost every package to extract text patterns regex are used, I will suggest you to use regex:

https?:[^\s,'[\]();]+

Upvotes: 0

buran
buran

Reputation: 14273

print(url[0].get('href').split("'")[1])

e.g.

href = "javascript:NewWindow('https://contract.ibkr.info/index.php?action=Details&site=GEN&conid=48811132','Details','600','600','custom','front');"
print(href.split("'")[1])

output

https://contract.ibkr.info/index.php?action=Details&site=GEN&conid=48811132

Upvotes: 1

Related Questions