Philip Crocker
Philip Crocker

Reputation: 117

Python: searching through html file grabbing <a> tags with the href and text content

I need help with a solution to search through a html file with Python3 and retreive all of the <a> links on the page. Then appending the grabbed value to a dictionary with the adjacent href (url).

This is what I've already tried.

import urllib3
import re

http = urllib3.PoolManager()
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL"
a = http.request("GET",my_url)
html = a.data

links = re.finditer(' href="?([^\s^"]+)', html)

for link in links:
  print(link)

I'm getting this error...

TypeError: can't use a string pattern on a bytes-like object

Thanks for your help.

I've also tried lxml...

links = lxml.html.parse("http://www.google.co.uk/?gws_rd=ssl#q=apple+stock&tbm=nws").xpath("//a/@href")
for link in links:
    print(link)

The result does not show all the links and I'm not sure why.

UPDATE:

New code =>

    def news_feed(self, stock):
    http = urllib3.PoolManager()
    my_url = "https://in.finance.yahoo.com/q/h?s="+stock
    a = http.request("GET",my_url)
    html = a.data.decode('utf-8')
    xml = fromstring(html, HTMLParser())
    a_tags = xml.xpath("//a/@href")
    xml = fromstring(html, HTMLParser())
    a_tags = xml.xpath("//table[@id='yfncsumtab']//a")
    self.paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
    pp(self.paired)

Upvotes: 1

Views: 1854

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180471

Use a html parser and decode the bytes as suggested, BeautifulSoup will make the job very easy and it a lot more reliable than a regex when parsing html:

http = urllib3.PoolManager()
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL"
a = http.request("GET", my_url)
html = a.data.decode("utf-8")

from bs4 import BeautifulSoup

print([a["href"] for a in BeautifulSoup(html).find_all("a",href=True)])

If you only want the links starting with http you can use a css select:

soup = BeautifulSoup(html)

print([a["href"] for a in soup.select("a[href^=http]")])

Which will give you:

['https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://help.yahoo.com/l/in/yahoo/finance/', 'http://in.yahoo.com/bin/set?cmp=uheader&src=others', 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN', 'http://in.my.yahoo.com', 'https://in.yahoo.com/', 'https://in.finance.yahoo.com', 'https://in.finance.yahoo.com/investing/', 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy', 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html', 'https://in.finance.yahoo.com/news/apple-sees-first-sales-dip-011402926.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-031840725.html', 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html', 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote', 'http://www.capitaliq.com', 'http://www.csidata.com', 'http://www.morningstar.com/']

To get the text and href:

soup = BeautifulSoup(html)

a_tags = soup.select("a[href^=http]")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)

pp(paired)

Output:

 {u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 u'Capital IQ': 'http://www.capitaliq.com',
 u'Commodity Systems, Inc. (CSI)': 'http://www.csidata.com',
 u'Download the new Yahoo Mail app': 'https://in.mobile.yahoo.com/mail/',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 u'Help': 'https://help.yahoo.com/l/in/yahoo/finance/',
 u'Mail': 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN',
 u'Markets': 'https://in.finance.yahoo.com/investing/',
 u'Morningstar, Inc.': 'http://www.morningstar.com/',
 u'My Yahoo': 'http://in.my.yahoo.com',
 u'New User? Register': 'https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
 u'Report an Issue': 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 u'Sign In': 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html',
 u'Yahoo': 'https://in.yahoo.com/',
 u'Yahoo India Finance': 'https://in.finance.yahoo.com',
 u'other exchanges': 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html',
 u'premium service.': 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote'}

The a[href^=http] means give me all the a tags that have href's and those href's values start with http.

Using lxml and using the table id to get just the story links which you are probably most interested in:

from lxml.etree  import fromstring, HTMLParser

xml = fromstring(_html, HTMLParser())

a_tags = xml.xpath("//table[@id='yfncsumtab']//a")

paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)

Gives you:

{'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 "Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 "Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 "EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}

We can do the same with out select:

soup = BeautifulSoup(_html)

a_tags = soup.select("#yfncsumtab a")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)
pp(paired)

Which will match our lxml output:

{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 u'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}

You could just use //*[@id='yfncsumtab']//a as id's should be unique.

To get the first six links from the table using an xpath, we can use the ul's and extract the first 6 using ul[position() < 7]:

a_tags  = xml.xpath("//*[@id='yfncsumtab']//ul[position() < 7]//a")

paired = dict((a.xpath("./text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)

Which will give you:

{'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 "Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html'}

For small tables, you could also simply slice.

Upvotes: 5

Related Questions