WhoDidYouSay
WhoDidYouSay

Reputation: 23

Python Web Scrape Index

I am VERY new to web scraping in any shape or form, I've been trying to get into Python and I heard that web scraping was a good way to expose myself to Python. So, after many Google searches I finally came down to the use of two highly recommended modules: Requests and BeautifulSoup. I've read up a fair amount on both and have a basic understanding on how to use them.

I found a very basic website (basic in that there isn't much content or javascript and the like, making parsing the HTML a lot easier) and I have the following code:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('http://www.basicwebs.co.uk/contact.htm').text)

for row in soup('div',{'id': 'Layer1'})[0].h2('font'):
    tds = row.text
    print tds

This code works. It produces the following result:

BASIC
    WEBS
Contact details
Contact details

Which, if you spend a few minutes inspecting the code on this page, is the correct result (I assume). Now, the thing is, while this code works, what if I wanted to get a different part of the page? Like the little paragraph on the page that states "If you are interested in having a website designed and hosted by us, please contact us either by e-mail or telephone." - my understanding would be to simply change the index number to the corresponding header that this text is found under, but when I change it I get a message that the list index is out of range.

Can anybody help? (as simple as you can make it, if possible)

I'm using Python 2.7.8

Upvotes: 0

Views: 2180

Answers (3)

M Ramzan
M Ramzan

Reputation: 1

from urllib.request import urlopen
from bs4 import BeautifulSoup

web_address=' http://www.basicwebs.co.uk/contact.htm'
html = urlopen(web_address)
bs = BeautifulSoup(html.read(), 'html.parser')

contact_info = bs.findAll('h2', {'align':'left'})[0]
for info in contact_info:
    print(info.get_text())

Upvotes: 0

avenet
avenet

Reputation: 3043

The text you require surrounded by the font tag with an attribute size=3, so one way to do it is by selecting the first occurrence of it like this:

font_elements = soup('font', {'size': 3})

if font_elements:
     print font_elements[0].text

RESULT:

If you are interested in having a website designed and hosted by us, please contact us either by e-mail or telephone.

Upvotes: 1

Anshul Sharma
Anshul Sharma

Reputation: 347

You can directly do this :

soup('font',{'size': '3'})[0].text

However, I want to draw your attention towards the mistake you made before.

soup('div',{'id': 'Layer1'})

this returns the div tag with id='Layer1' which can be more than one. So it basically returns a list of all HTML elements whose div tags have id='Layer1' but unfortunately the HTML you were trying to parse has one such element. So it went out of bound.

You can probably use some interactive interpreter of python like bpython or ipython to test what are you getting in an object.? Happy Hacking!!!

Upvotes: 1

Related Questions