Beautifulsoup, read specific href

Question

I want to read Shakespeare play from below webpage and collect data into a dataframe for further analysis: http://shakespeare.mit.edu/cymbeline/index.html

I am using beatifulsoup to read through the hyperlinks that take me to each ACT webpage where I can collect data. I am using below code to collect hyperlinks to each act as a list

> play1 = "http://shakespeare.mit.edu/cymbeline/index.html" play =
> urlopen(play1).read() soup = BeautifulSoup(play,"lxml") tr_act =
> soup.find_all("a") for i in tr_act:
>     print (i.get('href'))

Because of the html page structure, I am also getting some additional items which I do not need into the list

/Shakespeare
http://www.amazon.com/gp/product/1903436028?ie=UTF8&tag=theinteclasar-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1903436028
full.html
cymbeline.1.1.html
cymbeline.1.2.html....

How can I programmatically avoid reading first 3 href elements in my scraper code. HTML structure is very subtle that I am unable to identify how to organize my code to get that

>  You can buy the Arden text of this play from the Amazon.com online
> bookstore:  href="https://rads.stackoverflow.com/amzn/click/com/1903436028" rel="nofollow noreferrer">Cymbeline:
> Second Series - Paperback (The Arden Shakespeare. Second
> Series)
   Entire play in one
> page
   
>      Act 1, Scene 1: Britain. The garden of Cymbeline's palace.

>      Act 1, Scene 2: The same. A public place.

>      Act 1, Scene 3: A room in Cymbeline's palace.

>      Act 1, Scene 4: Rome. Philario's house.

Beautifulsoup, read specific href

Answers (1)

Related Questions