Reputation: 13
Can I scrape a jquery object? I actually want to get all href links from the jquery object. How could I achieve that? I just learned Python web scraping less than one week ago from YouTube and the Internet.
url_1='http://ws.bursamalaysia.com/market/listed-companies/company-announcements/announcements_listing_f.html?_=1449326650932&callback=jQuery16208050466175191104_1449326525662&page_category=company&category=FA&sub_category=all&all_gm=&alphabetical=All&board=§or=&date_from=&date_to=&company=5218&page=&testing='
#Standard url request
req = urllib.request.Request(url_1, headers=headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
soup = BeautifulSoup(respData, 'html.parser')
#soup.prettify()
pattern=re.compile("href")
links = soup.find_all(text=pattern)
print(links)
I still cannot get all the links.
It returns with many \\\n
from \n
. Why does this happen?
Should I convert them into string?
I tried using
links = soup.find_all('a')
print(links)
but it returns []
. Why is that so?
I can obtain href link in the normal webpage, but not on the jquery object.
Upvotes: 1
Views: 134
Reputation: 9647
I did not debug your code, but from the response of the url
you provided, I can see that the html content is a value, in a key value pair of the object. So for making a good soup, you need to extract that html first. Using requests you can do it like following.
import re
import json
from bs4 import BeautifulSoup
import requests
url='http://ws.bursamalaysia.com/market/listed-companies/company-announcements/announcements_listing_f.html?_=1449326650932&callback=jQuery16208050466175191104_1449326525662&page_category=company&category=FA&sub_category=all&all_gm=&alphabetical=All&board=§or=&date_from=&date_to=&company=5218&page=&testing='
pat = re.compile(r'\(\s*(\{[\s,\w,\W]*\})\s*\)')
r = requests.get(url)
js_obj = json.loads(pat.search(r.text).group(1))
soup = BeautifulSoup(js_obj.get('html'), 'lxml')
links = map(lambda a: a.get('href'), soup.find_all('a'))
for link in links:
print(link)
This will yield a output like:
/market/listed-companies/list-of-companies/plc-profile.html?stock_code=5218
/market/listed-companies/company-announcements/4867445
/market/listed-companies/list-of-companies/plc-profile.html?stock_code=5218
/market/listed-companies/company-announcements/4772849
....
Upvotes: 1