user10436512
user10436512

Reputation:

Printing the content of all html files in a directory with BeautifulSoup

I opened a directory, containing 200 html files using BeautifulSoup, but when I try to print the content of the all directory with print(soup.prettify()) it only shows the content of only one HTML file. The same happens if I try soup.find('title'), it only loads the title of the same HTML file as before. Can you tell me why ? Python does not show any error and I cannot understand what is wrong in my code.


import os
from bs4 import BeautifulSoup
import glob
import errno

dir_path = '/directory/path/to/folder/'
files = glob.glob(dir_path)
for name in files:
    try:
        with open(name) as f:
            soup = BeautifulSoup(f, "html.parser")
            print(type(soup))
    except IOError as exc:
        if exc.errno != errno.EISDIR:
            raise

print(type(soup))
soup.find('title')

Upvotes: 1

Views: 971

Answers (2)

glhr
glhr

Reputation: 4537

The problem here is that you're passing a directory path to glob instead of a file path specification (see the documentation for glob.glob()). Assuming you want to parse every HTML file in the student directory, you can define the path as:

dir_path = '/directory/path/to/folder/*.html' 

Note the wildcard *, which means that dir_path will match any HTML file in the student directory.

Upvotes: 0

PARVATHY
PARVATHY

Reputation: 122

The module finds all the pathnames matching a specified pattern (see documentation). So, pass the dir_path argument as a pattern that matches all the HTML file names, by making use of the wildcard character *. Try doing:

dir_path = '/directory/path/to/folder/*.html' 

Upvotes: 1

Related Questions