maxcraft
maxcraft

Reputation: 23

Counting number of HTML tags through webscraping

My output should be the separate total number of each heading tag used ( "H1"-"H6" header tags used on the page), paragraphs, images, and links

I am getting an output of the incorrect number it is not finding H tags at all and the counter outputs 1 for the header tags. How do I count the correct number of html tags?

import re
from bs4 import BeautifulSoup
import requests
from collections import Counter
from string import punctuation



#main program



    link_url = input("Please Enter the website address ")
#retrieves url for parsing
    r = requests.get(link_url)

    b_soup = BeautifulSoup(r.content, features="html.parser")

#Searaching/parsing for various sized header content
    headerH1 = headH2 = headerH3 = headerH4 = headerH5 = headerH6 = 0

    for header_tags in b_soup.findAll():

        if(header_tags.name == "H1" or header_tags.name == "<H1>"):

         headerH1 = headerH1+1

    if(header_tags.name == "H2" or header_tags.name == "<H2 >"):

        headH2 = headH2+1

    if(header_tags.name == "H3" or header_tags.name == "<H3 >"):

        headerH3 = headerH3+1

    if(header_tags.name == "H4" or header_tags.name == "<H4 >"):

        headerH4 = headerH4+1

    if(header_tags.name == "H5" or header_tags.name == "<H5 >"):

        headerH5 = headerH5+1

    if(header_tags.name == "H6" or header_tags.name == "<H6 >"):

        headerH6 = headerH6+1

    print("Total Headings in H1: ", headerH1)

    print("Total Headings in H2: ", headH2)

    print("Total Headings in H3: ", headerH3)

    print("Total HeadingS in H4: ", headerH4)

    print("Total Headings in H4: ", headerH5)

    print("Total Headings in H5: ", headerH6)



    count = 0
#counting number of paragraphs
    for header_tags in b_soup.findAll():

        if(header_tags.name == 'p' or header_tags.name == '<p>'):

            count = count+1

    print("Paragraphs: ", count)


#counting image total
    for img in b_soup.findAll():

        if(img.name == 'img'):

            count = count+1

    print("Images: ", count)

    count = 0
#counting number of links
    for link in b_soup.find_all('a', href=True):

        count = count+1

    print("Links: ", count)


my output


Total Headings in H1:  1
Total Headings in H2:  1
Total Headings in H3:  1
Total HeadingS in H4:  1
Total Headings in H4:  1
Total Headings in H5:  1
Paragraphs:  23
Images:  33
Links:  70

The correct output of the website I was using should actually be similar too

Number of H1 Headings: 9

Number of images on this page: 10 

You don't need the website I use you can use any link to test it.

Upvotes: 2

Views: 1456

Answers (1)

Felix
Felix

Reputation: 332

Here is an example to count the number of <h1> tags from some HTML code:

from bs4 import BeautifulSoup
html = "<h1>first</h1><h1>second</h1><h2>third</h2>"
soup = BeautifulSoup(html, 'html.parser')
h1s = soup.find_all('h1')
h1_count = len(h1s) # Gets the number of <h1> tags

In this example h1_count would be 2.

You can do the same for other tag types by replacing h1 in find_all('h1'):

h2s = soup.find_all('h2')
h3s = soup.find_all('h3')
...
h2_count = len(h2s)
h3_count = len(h3s)

Hope this helps.

Upvotes: 4

Related Questions