Reputation: 3
I have a text file of ~500k lines with fairly random HTML syntax. The rough structure of the file is as follows:
content <title> title1 </title> more words
title contents2 title more words <body> <title> title2 </title>
<body><title>title3</title></body>
I want to extract all contents in between the tags.
title1
title2
title3
This is what I have tried so far:
content_list = []
with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', errors = 'ignore') as openfile2:
for line in openfile2:
for item in line.split("<title>"):
if "</title>" in item:
content = (item [ item.find("<title>")+len("<title>") : ])
content_list.append(content)
But this method is not retrieving all tags. I think this could be due to the tags that are connected to other words, without spaces. Ie. <body><title>
.
I've considered replacing every '<' and '>' with a space and performing the same method, but if I was to do this, I would get "contents2" as an output.
Upvotes: 0
Views: 1702
Reputation: 51
Try running:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', 'r'), 'html.parser')
content_list = []
contents = soup.find_all('title')
for content in content:
print(content.get_text().strip())
content_list.append(content.get_text().strip())
Upvotes: 0
Reputation: 1146
An example with your code syntax :
with open('file.txt', 'r') as file:
for line in file:
for item in line.split('<title>'):
if '</title>' in item:
content_list.append(str.strip(item.split('</title>')[0]))
print(content_list)
But BeautifulSoup is for me the best alternative anyway.
Upvotes: 0
Reputation: 325
I believe you could do this with BeautifulSoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('file_to_read.txt', 'r'), 'html.parser')
print(soup.find_all('title'))
# [<title> title1 </title>, <title> title2 </title>, <title>title3</title>]
print(soup.find_all('title')[0].text)
# ' title1 '
Upvotes: 1