mmm
mmm

Reputation: 1446

Tags in scraped content must have the same order as they have in the original HTML file

I try to build a web scraper. My scraper must find all rows which correspond with chosen tags and save them in the same order as original HTML to the new file.md file.

Tags are specified in an array:

list_of_tags_you_want_to_scrape = ['h1', 'h2', 'h3', 'p', 'li']

then this gives me only the content within specified tag:

soup_each_html = BeautifulSoup(particular_page_content, "html.parser")
inner_content = soup_each_html.find("article", "container")

let's say that this is the result:

<article class="container">
  <h1>this is headline 1</h1>
  <p>this is paragraph</p>
  <h2>this is headline 2</h2>
  <a href="bla.html">this won't be shown bcs 'a' tag is not in the array</a>
</article>

then I have method which is responsible for writing a line to the file.md file if a tag from the array exists in the content

with open("file.md", 'a+') as f:
    for tag in list_of_tags_you_want_to_scrape:
        inner_content_tag = inner_content.find_all(tag)

        for x in inner_content_tag:
            f.write(str(x))
            f.write("\n")

and it does. But the problem is, that it goes through the array (for each) and it will save all <h1> first, all <h2> on a second place, etc. And that's because that is the order specified in list_of_tags_you_want_to_scrape array.

this would be the result:

<article class="container">
  <h1>this is headline 1</h1>
  <h2>this is headline 2</h2>
  <p>this is paragraph</p>
</article>

so I would like to have them in the right order like the original HTML has. After the first <h1> should be <p> element.

That means that I would probably need to do for each loop also with inner_content and check if each line from this inner_content is equals to at least one of the tags from the array. If yes then save and then move to another line. I tried to do it and made for each for inner_content to get line by line but it gave me an error and I am not sure if it is correct way how to do it. (First day using BeautifulSoup module)

Any tips or advices how to modify my method to achieve this please? thank you!

Upvotes: 2

Views: 220

Answers (1)

Ajax1234
Ajax1234

Reputation: 71471

To maintain the original order of the html input, you can use recursion to loop over the soup.contents attribute:

from bs4 import BeautifulSoup as soup
def parse(content, to_scrape = ['h1', 'h2', 'h3', 'p', 'li']):
   if content.name in to_scrape:
      yield content
   for i in getattr(content, 'contents', []):
      yield from parse(i)

Example:

html = """   
<html>
  <body>
      <h1>My website</h1>
      <p>This is my first site</p>
      <h2>See a listing of my interests below</h2>
      <ul>
         <li>programming</li>
         <li>math</li>
         <li>physics</li>
      </ul>
      <h3>Thanks for visiting!</h3>
  </body>
</html>
"""

result = list(parse(soup(html, 'html.parser')))

Output:

[<h1>My website</h1>, <p>This is my first site</p>, <h2>See a listing of my interests below</h2>, <li>programming</li>, <li>math</li>, <li>physics</li>, <h3>Thanks for visiting!</h3>]

As you can see, the original order of the html is maintained, and can now be written to the file:

with open('file.md', 'w') as f:
   f.write('\n'.join(map(str, result)))

Each bs4 object contains a name and contents attribute, among others. The name attribute is the tag name itself, while the contents attribute stores all the child HTML. parse uses a generator to first check if the passed bs4 object has a tag that belongs to the to_scrape list and if so, yields that value. Lastly, parse iterates over the contents of content, and calls itself on each element.

Upvotes: 1

Related Questions