user1183556
user1183556

Reputation:

Extract text from .html file, remove HTML, and write to text file using Python and Beautiful Soup

I'm using Beautiful Soup 4 to extract text from HTML files, and using get_text() I can easily extract just the text, but now I'm attempting to write that text to a plain text file, and when I do, I get the message "416." Here's the code I'm using:

from bs4 import BeautifulSoup
markup = open("example1.html")
soup = BeautifulSoup(markup)
f = open("example.txt", "w")
f.write(soup.get_text())

And the output to the console is 416 but nothing gets written to the text file. Where have I gone wrong?

Upvotes: 2

Views: 7808

Answers (1)

danodonovan
danodonovan

Reputation: 20343

You need to send text to the BeautifulSoup class. Maybe try markup.read()

from bs4 import BeautifulSoup
markup = open("example1.html")
soup = BeautifulSoup(markup.read())
markup.close()
f = open("example.txt", "w")
f.write(soup.get_text())
f.close()

and in a more pythonic style

from bs4 import BeautifulSoup

with open("example1.html") as markup:
    soup = BeautifulSoup(markup.read())

with open("example.txt", "w") as f: 
    f.write(soup.get_text())

as @bernie suggested

Upvotes: 5

Related Questions