Repeat text extraction with Python

Question

I have the following code which I would like to use to extract texts information between and . It works fine but it only extracts one unit (the first one) whereas I would like to extract all textual units between these tags. I tried to do this with a bash loop code but it didn't work.

import os

directory_path ='C:\My_folder\tmp'

    for files in os.listdir(directory_path):

    print(files)

    path_for_files = os.path.join(directory_path, files)

    text = open(path_for_files, mode='r', encoding='utf-8').read()

    starting_tag = ''

    ground = text[text.find(starting_tag):text.find(ending_tag)]

    results_dir = 'C:\My_folder\tmp'
    results_file = files[:-4] + 'txt'

    path_for_files = os.path.join(results_dir, results_file)

    open(path_for_files, mode='w', encoding='UTF-8').write(result)

Avinash Raj · Accepted Answer

You could use Beautiful Soup's css selectors.

>>> from bs4 import BeautifulSoup
>>> s = "foo  foobar  bar"
>>> soup = BeautifulSoup(s, 'lxml')
>>> for i in soup.select('font[color="#FF0000"]'):
    print(i.text)


 foobar

Repeat text extraction with Python

Answers (2)

Related Questions