casual
casual

Reputation: 13

Unable to do scraping href on jquery object

Can I scrape a jquery object? I actually want to get all href links from the jquery object. How could I achieve that? I just learned Python web scraping less than one week ago from YouTube and the Internet.

url_1='http://ws.bursamalaysia.com/market/listed-companies/company-announcements/announcements_listing_f.html?_=1449326650932&callback=jQuery16208050466175191104_1449326525662&page_category=company&category=FA&sub_category=all&all_gm=&alphabetical=All&board=&sector=&date_from=&date_to=&company=5218&page=&testing='

#Standard url request
    req = urllib.request.Request(url_1, headers=headers)
    resp = urllib.request.urlopen(req)
    respData = resp.read()

    soup = BeautifulSoup(respData, 'html.parser')
    #soup.prettify()

    pattern=re.compile("href")
    links = soup.find_all(text=pattern)
    print(links)

I still cannot get all the links. It returns with many \\\n from \n. Why does this happen? Should I convert them into string?

I tried using

links = soup.find_all('a')
print(links)

but it returns []. Why is that so?

I can obtain href link in the normal webpage, but not on the jquery object.

Upvotes: 1

Views: 134

Answers (1)

salmanwahed
salmanwahed

Reputation: 9647

I did not debug your code, but from the response of the url you provided, I can see that the html content is a value, in a key value pair of the object. So for making a good soup, you need to extract that html first. Using requests you can do it like following.

import re
import json

from bs4 import BeautifulSoup
import requests


url='http://ws.bursamalaysia.com/market/listed-companies/company-announcements/announcements_listing_f.html?_=1449326650932&callback=jQuery16208050466175191104_1449326525662&page_category=company&category=FA&sub_category=all&all_gm=&alphabetical=All&board=&sector=&date_from=&date_to=&company=5218&page=&testing='
pat = re.compile(r'\(\s*(\{[\s,\w,\W]*\})\s*\)')

r = requests.get(url)
js_obj = json.loads(pat.search(r.text).group(1))

soup = BeautifulSoup(js_obj.get('html'), 'lxml')

links = map(lambda a: a.get('href'), soup.find_all('a'))

for link in links:
    print(link)

This will yield a output like:

/market/listed-companies/list-of-companies/plc-profile.html?stock_code=5218
/market/listed-companies/company-announcements/4867445
/market/listed-companies/list-of-companies/plc-profile.html?stock_code=5218
/market/listed-companies/company-announcements/4772849
....

Upvotes: 1

Related Questions