Reputation: 1255
I want to read Shakespeare play from below webpage and collect data into a dataframe for further analysis: http://shakespeare.mit.edu/cymbeline/index.html
I am using beatifulsoup to read through the hyperlinks that take me to each ACT webpage where I can collect data. I am using below code to collect hyperlinks to each act as a list
> play1 = "http://shakespeare.mit.edu/cymbeline/index.html" play =
> urlopen(play1).read() soup = BeautifulSoup(play,"lxml") tr_act =
> soup.find_all("a") for i in tr_act:
> print (i.get('href'))
Because of the html page structure, I am also getting some additional items which I do not need into the list
/Shakespeare
http://www.amazon.com/gp/product/1903436028?ie=UTF8&tag=theinteclasar-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1903436028
full.html
cymbeline.1.1.html
cymbeline.1.2.html....
How can I programmatically avoid reading first 3 href elements in my scraper code. HTML structure is very subtle that I am unable to identify how to organize my code to get that
> <p>You can buy the Arden text of this play from the Amazon.com online
> bookstore: <a
> href="https://rads.stackoverflow.com/amzn/click/com/1903436028" rel="nofollow noreferrer">Cymbeline:
> Second Series - Paperback (The Arden Shakespeare. Second
> Series)</a></p> <p><a href="full.html">Entire play</a> in one
> page</p> <p>
> Act 1, Scene 1: <a href="cymbeline.1.1.html">Britain. The garden of Cymbeline's palace.</a><br>
> Act 1, Scene 2: <a href="cymbeline.1.2.html">The same. A public place.</a><br>
> Act 1, Scene 3: <a href="cymbeline.1.3.html">A room in Cymbeline's palace.</a><br>
> Act 1, Scene 4: <a href="cymbeline.1.4.html">Rome. Philario's house.</a><br>
Upvotes: 0
Views: 511
Reputation: 504
Try the following:
from bs4 import BeautifulSoup
import requests
import re
url = 'http://shakespeare.mit.edu/cymbeline/index.html'
html = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)AppleWebKit/537.36 (KHTML, like Gecko)Chrome/51.0.2704.84 Safari/537.36"}).content
bsObj = BeautifulSoup(html, "lxml")
links = bsObj.findAll('a', href=re.compile('(cymbeline)'))
finalLinks = []
for link in links:
finalLinks.append('http://shakespeare.mit.edu/cymbeline/' + link.attrs['href'])
print finalLinks
Upvotes: 1