Michael Lin
Michael Lin

Reputation: 105

beautifulsoup parse html file contents

I have 30911 html files in a folder. I need to (1) check whether it contains a tag:

<strong>123</strong>

and (2) extract the following contents until this section is ended.

But I found a problem is that some of them ended before

<strong>567</strong>

And some of them do not have such tag, which is ended before

<strong>89/strong> or others(that I do not know because I cant check 30K+files)

It also has different p p_number in each file and sometime does not have id

So first I use beautifulsoup to search, but I don't know how to do the next extracting contents

soup = bs4.BeautifulSoup(fo, "lxml")
m = soup.find("strong", string=re.compile("123"))

Btw, is is possible to save the content as txt format, but it will look like in html format?

line 1
line 2
...
lin 50

If use p.get_text(strip=true), it's all together.

line1 content line2 content ... 
line50 content....

Upvotes: 1

Views: 1471

Answers (1)

alecxe
alecxe

Reputation: 473803

If I understand you correctly, you can first find the starting point - a p element that has a strong element with "Question-and-Answer Session" text. Then, you can iterate over the p element's next siblings until you hit the one which has a strong element with "Copyright policy" text.

Complete reproduceable example:

import re

from bs4 import BeautifulSoup


data = """
<body>
    <p class="p p4" id="question-answer-session">
      <strong>
       Question-and-Answer Session
      </strong>
    </p>

    <p class="p p4">
       Hi John and Greg, good afternoon. contents....
    </p>

    <p class="p p14">
      <strong>
       Copyright policy:
      </strong>
      other content about the policy....
    </p>
</body>
"""

soup = BeautifulSoup(data, "html.parser")

def find_question_answer(tag):
    return tag.name == 'p' and tag.find("strong", text=re.compile(r"Question-and-Answer Session"))

question_answer = soup.find(find_question_answer)
for p in question_answer.find_next_siblings("p"):
    if p.find("strong", text=re.compile(r"Copyright policy")):
        break

    print(p.get_text(strip=True))

Prints:

Hi John and Greg, good afternoon. contents....

Upvotes: 1

Related Questions