Iterate through multiple files and append text from HTML using Beautiful Soup

Question

I have a directory of downloaded HTML files (46 of them) and I am attempting to iterate through each of them, read their contents, strip the HTML, and append only the text into a text file. However, I'm unsure where I'm messing up, though, as nothing gets written to my text file?

import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
        markup = (path)
        soup = BeautifulSoup(markup)
        with open("example.txt", "a") as myfile:
                myfile.write(soup)
                f.close()

-----update---- I've updated my code as below, however the text file still doesn't get created.

import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(markup)
    with open("example.txt", "a") as myfile:
        myfile.write(soup)
        myfile.close()

-----update 2-----

Ah, I caught that I had my directory incorrect, so now I have:

import os
import glob
from bs4 import BeautifulSoup

path = "c:\users\me\downloads\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(markup)
    with open("example.txt", "a") as myfile:
        myfile.write(soup)
        myfile.close()

When this is executed, I get this error:

Traceback (most recent call last):
  File "C:\Users\Me\Downloads\bsoup.py, line 11 in 
    myfile.write(soup)
TypeError: must be str, not BeautifulSoup

I fixed this last error by changing

myfile.write(soup)

to

myfile.write(soup.get_text())

-----update 3 ----

It's working properly now, here's the working code:

import os
import glob
from bs4 import BeautifulSoup

path = "c:\users\me\downloads\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read())
    with open("example.txt", "a") as myfile:
        myfile.write(soup.get_text())
        myfile.close()

Moj · Accepted Answer

actually you are not reading html file, this should work,

soup=BeautifulSoup(open(webpage,'r').read(), 'lxml')

Iterate through multiple files and append text from HTML using Beautiful Soup

Answers (2)

Related Questions