Webscraping HTML with Python

Question

Sorry if this is a repeat but I've been looking through a lot of StackOverflow questions on this and can't find a similar situation. I might be barking up the wrong tree here but I'm new to programming so even if someone could set me on the right path it'd help out immensely.

I'm trying to scrape data from a website that can only be accessed from inside our network using python 3.7 and Beautiful soup 4. My first question is, is this a best practice way to do it for a novice programmer or should I be looking into something like javascript instead of python?

My second question is the website's root html file has the following html tag xmlns="http://www.w3.org/1999/xhtml". Does BeautifulSoup4 work with xhtml?

I'll admit that I know nothing about web developing so even if someone can give me a few keywords or tips to start researching to get me on a more productive path it'd be appreciated. Right now my biggest problem is I don't know what I don't know and all python webscraping examples work on much simpler .html pages vs. this one where the webpages tree consists of multiple html/css/jpg and gif files.

Thanks, -Dane

Jamie Lindsey · Accepted Answer

Python, requests and BeautifulSoup are definitely the way to go, especially for a beginner. BeautifulSoup works with all variations of html, xml and so on.

You will need to install python and then install requests and bs4. Both ae easy to do by reading the requests docs and the bs4 docs.

I would suggest you learn a little of the basics of python3 if you don't know already.

Here is a little simple example to get the title of the page you request:

import requests
from bs4 import BeautifulSoup as bs

url = 'http://some.local.domain/'

response = requests.get(url)
soup = bs(response.text, 'html.parser')

# let's get title of the page
title = soup.title
print(title)

# let's get all the links in the page
links = soup.find_all('a')
for link in links:
    print(link.get('href'))
    link1 = link[0]
    link2 = link[1]

# let's follow a link we find in the page (we'll go for the first)
response = requests.get(link1, stream=True)
# if we have an image and we want to download it 
if response.status_code == 200:
    with open(url.split('/')[-1], 'wb') as f:
        for chunk in response:
            f.write(chunk)

# if the link is another web page
response = requests.get(link2)
soup = bs(response.text, 'html.parser')

# let's get title of the page
title = soup.title
print(title)

Go on the hunt for tutorials on requests, and BeautfiulSoup there are tonnes of them... like this one

Webscraping HTML with Python

Answers (1)

Related Questions