Reputation:
I opened a directory, containing 200 html files using BeautifulSoup, but when I try to print the content of the all directory with print(soup.prettify())
it only shows the content of only one HTML file. The same happens if I try soup.find('title')
, it only loads the title of the same HTML file as before. Can you tell me why ? Python does not show any error and I cannot understand what is wrong in my code.
import os
from bs4 import BeautifulSoup
import glob
import errno
dir_path = '/directory/path/to/folder/'
files = glob.glob(dir_path)
for name in files:
try:
with open(name) as f:
soup = BeautifulSoup(f, "html.parser")
print(type(soup))
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
print(type(soup))
soup.find('title')
Upvotes: 1
Views: 971
Reputation: 4537
The problem here is that you're passing a directory path to glob
instead of a file path specification (see the documentation for glob.glob()
). Assuming you want to parse every HTML file in the student
directory, you can define the path as:
dir_path = '/directory/path/to/folder/*.html'
Note the wildcard *
, which means that dir_path
will match any HTML file in the student
directory.
Upvotes: 0
Reputation: 122
The glob module finds all the pathnames matching a specified pattern (see documentation). So, pass the dir_path
argument as a pattern that matches all the HTML file names, by making use of the wildcard character *
. Try doing:
dir_path = '/directory/path/to/folder/*.html'
Upvotes: 1