Tags in scraped content must have the same order as they have in the original HTML file

Question

I try to build a web scraper. My scraper must find all rows which correspond with chosen tags and save them in the same order as original HTML to the new file.md file.

Tags are specified in an array:

list_of_tags_you_want_to_scrape = ['h1', 'h2', 'h3', 'p', 'li']

then this gives me only the content within specified tag:

soup_each_html = BeautifulSoup(particular_page_content, "html.parser")
inner_content = soup_each_html.find("article", "container")

let's say that this is the result:


  this is headline 1
  this is paragraph
  this is headline 2
  this won't be shown bcs 'a' tag is not in the array

then I have method which is responsible for writing a line to the file.md file if a tag from the array exists in the content

with open("file.md", 'a+') as f:
    for tag in list_of_tags_you_want_to_scrape:
        inner_content_tag = inner_content.find_all(tag)

        for x in inner_content_tag:
            f.write(str(x))
            f.write("
")

and it does. But the problem is, that it goes through the array (for each) and it will save all

first, all

on a second place, etc. And that's because that is the order specified in list_of_tags_you_want_to_scrape array.

this would be the result:


  this is headline 1
  this is headline 2
  this is paragraph

so I would like to have them in the right order like the original HTML has. After the first

should be
element.

That means that I would probably need to do for each loop also with `inner_content` and check if each line from this inner_content is equals to at least one of the tags from the array. If yes then save and then move to another line. I tried to do it and made for each for inner_content to get line by line but it gave me an error and I am not sure if it is correct way how to do it. (First day using BeautifulSoup module)

Any tips or advices how to modify my method to achieve this please? thank you!

Ajax1234 · Accepted Answer

To maintain the original order of the html input, you can use recursion to loop over the soup.contents attribute:

from bs4 import BeautifulSoup as soup
def parse(content, to_scrape = ['h1', 'h2', 'h3', 'p', 'li']):
   if content.name in to_scrape:
      yield content
   for i in getattr(content, 'contents', []):
      yield from parse(i)

Example:

html = """   

  
      My website
      This is my first site
      See a listing of my interests below
      
         programming
         math
         physics
      
      Thanks for visiting!
  

"""

result = list(parse(soup(html, 'html.parser')))

Output:

[My website
, This is my first site
, See a listing of my interests below
, programming
, math
, physics
, Thanks for visiting!]

As you can see, the original order of the html is maintained, and can now be written to the file:

with open('file.md', 'w') as f:
   f.write('
'.join(map(str, result)))

Each bs4 object contains a name and contents attribute, among others. The name attribute is the tag name itself, while the contents attribute stores all the child HTML. parse uses a generator to first check if the passed bs4 object has a tag that belongs to the to_scrape list and if so, yields that value. Lastly, parse iterates over the contents of content, and calls itself on each element.

Tags in scraped content must have the same order as they have in the original HTML file

Answers (1)

Related Questions