Abhi
Abhi

Reputation: 1255

Beautifulsoup, read specific href

I want to read Shakespeare play from below webpage and collect data into a dataframe for further analysis: http://shakespeare.mit.edu/cymbeline/index.html

I am using beatifulsoup to read through the hyperlinks that take me to each ACT webpage where I can collect data. I am using below code to collect hyperlinks to each act as a list

> play1 = "http://shakespeare.mit.edu/cymbeline/index.html" play =
> urlopen(play1).read() soup = BeautifulSoup(play,"lxml") tr_act =
> soup.find_all("a") for i in tr_act:
>     print (i.get('href'))

Because of the html page structure, I am also getting some additional items which I do not need into the list

/Shakespeare
http://www.amazon.com/gp/product/1903436028?ie=UTF8&tag=theinteclasar-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1903436028
full.html
cymbeline.1.1.html
cymbeline.1.2.html....

How can I programmatically avoid reading first 3 href elements in my scraper code. HTML structure is very subtle that I am unable to identify how to organize my code to get that

>  <p>You can buy the Arden text of this play from the Amazon.com online
> bookstore: <a
> href="https://rads.stackoverflow.com/amzn/click/com/1903436028" rel="nofollow noreferrer">Cymbeline:
> Second Series - Paperback (The Arden Shakespeare. Second
> Series)</a></p>   <p><a href="full.html">Entire play</a> in one
> page</p>   <p>
>      Act 1, Scene 1: <a href="cymbeline.1.1.html">Britain. The garden of Cymbeline's palace.</a><br>
>      Act 1, Scene 2: <a href="cymbeline.1.2.html">The same. A public place.</a><br>
>      Act 1, Scene 3: <a href="cymbeline.1.3.html">A room in Cymbeline's palace.</a><br>
>      Act 1, Scene 4: <a href="cymbeline.1.4.html">Rome. Philario's house.</a><br>

Upvotes: 0

Views: 511

Answers (1)

Tarun Gupta
Tarun Gupta

Reputation: 504

Try the following:

from bs4 import BeautifulSoup
import requests
import re

url = 'http://shakespeare.mit.edu/cymbeline/index.html'
html = requests.get(url, headers={
                        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64)AppleWebKit/537.36 (KHTML, like Gecko)Chrome/51.0.2704.84 Safari/537.36"}).content

bsObj = BeautifulSoup(html, "lxml")

links = bsObj.findAll('a', href=re.compile('(cymbeline)'))
finalLinks = []
for link in links:
    finalLinks.append('http://shakespeare.mit.edu/cymbeline/' + link.attrs['href'])

print finalLinks

Upvotes: 1

Related Questions