Reputation: 157
I have the following code which I would like to use to extract texts information between <font color='#FF0000'> and </font>
. It works fine but it only extracts one unit (the first one) whereas I would like to extract all textual units between these tags. I tried to do this with a bash loop code but it didn't work.
import os
directory_path ='C:\\My_folder\\tmp'
for files in os.listdir(directory_path):
print(files)
path_for_files = os.path.join(directory_path, files)
text = open(path_for_files, mode='r', encoding='utf-8').read()
starting_tag = '<font color='
ending_tag = '</font>'
ground = text[text.find(starting_tag):text.find(ending_tag)]
results_dir = 'C:\\My_folder\\tmp'
results_file = files[:-4] + 'txt'
path_for_files = os.path.join(results_dir, results_file)
open(path_for_files, mode='w', encoding='UTF-8').write(result)
Upvotes: 0
Views: 339
Reputation: 174874
You could use Beautiful Soup's css selectors.
>>> from bs4 import BeautifulSoup
>>> s = "foo <font color='#FF0000'> foobar </font> bar"
>>> soup = BeautifulSoup(s, 'lxml')
>>> for i in soup.select('font[color="#FF0000"]'):
print(i.text)
foobar
Upvotes: 2
Reputation: 10221
You can also use lxml.html
>>> import lxml.html as PARSER
>>> s = "<html><body>foo <font color='#FF0000'> foobar </font> bar</body></html>"
>>> root = PARSER.fromstring(s)
>>> for i in root.getiterator("font"):
... try: i.attrib["color"]
... except:pass
Upvotes: 0