BeautifulSoup get text between tags for one line

Question

I have a bunch of HTML documents of GCOV branch and line coverage tools, the files look like this:


224
✓✗✗✓
✗✓
329
        line of C++ code



225


   another line of  C++ code;

I would like to extract the text "(another) line of C++" code and ideally also the line number so the output would look like this:

224 line of C++ code
225 another line of C++ code

I tried to use BeautifulSoup but it does not provide the requested output, my code looks like this:

from itertools import islice
import codecs
import glob
from ntpath import join
import os
from bs4 import BeautifulSoup

lineNo = ""
linetextCovered = ""
linetextNotCovered = ""
open('Output.txt', 'w').close() #Erase any content of Output.txt file

for filepath in glob.iglob('path/To/Reports/*.html'):
    with codecs.open(os.path.join(filepath), "r") as inputFile, open('Output.txt',"a") as outputFile:
        for num, line in enumerate(inputFile, 1):
            if lineNo in line:
                inputSoup = BeautifulSoup(line)
                text = inputSoup.getText()
                outputFile.write("".join(islice(text, 1) + "	"))
            if linetextCovered or linetextNotCovered in line:
                inputSoup = BeautifulSoup(line)
                text = inputSoup.getText()
                outputFile.write("".join(islice(text, 4)))
            outputFile.write("
")
print("Done")

But the output looks like this
/* L
a:li
{

colo
text
}

What am I doing wrong?
Thank you very much for any help.

mama · Accepted Answer

You can do like this:

from bs4 import BeautifulSoup

html = '''

224
✓✗✗✓
✗✓
329
        line of C++ code



225


   another line of  C++ code;

'''


for tr in BeautifulSoup(html.encode(), 'html.parser').find_all('tr'):
    lineno  = tr.find('td',{'class':'src'}).text.strip()
    src     = tr.find('td', {'class':'lineno'}).text.strip()
    print(lineno, src)

BeautifulSoup get text between tags for one line

Answers (1)

Related Questions