Reputation: 1446
I try to build a web scraper. My scraper must find all rows which correspond with chosen tags and save them in the same order as original HTML to the new file.md
file.
Tags are specified in an array:
list_of_tags_you_want_to_scrape = ['h1', 'h2', 'h3', 'p', 'li']
then this gives me only the content within specified tag:
soup_each_html = BeautifulSoup(particular_page_content, "html.parser")
inner_content = soup_each_html.find("article", "container")
let's say that this is the result:
<article class="container">
<h1>this is headline 1</h1>
<p>this is paragraph</p>
<h2>this is headline 2</h2>
<a href="bla.html">this won't be shown bcs 'a' tag is not in the array</a>
</article>
then I have method which is responsible for writing a line to the file.md
file if a tag from the array exists in the content
with open("file.md", 'a+') as f:
for tag in list_of_tags_you_want_to_scrape:
inner_content_tag = inner_content.find_all(tag)
for x in inner_content_tag:
f.write(str(x))
f.write("\n")
and it does. But the problem is, that it goes through the array (for each) and it will save all <h1>
first, all <h2>
on a second place, etc. And that's because that is the order specified in list_of_tags_you_want_to_scrape
array.
this would be the result:
<article class="container">
<h1>this is headline 1</h1>
<h2>this is headline 2</h2>
<p>this is paragraph</p>
</article>
so I would like to have them in the right order like the original HTML has. After the first <h1>
should be <p>
element.
That means that I would probably need to do for each loop also with inner_content
and check if each line from this inner_content is equals to at least one of the tags from the array. If yes then save and then move to another line. I tried to do it and made for each for inner_content to get line by line but it gave me an error and I am not sure if it is correct way how to do it. (First day using BeautifulSoup module)
Any tips or advices how to modify my method to achieve this please? thank you!
Upvotes: 2
Views: 220
Reputation: 71471
To maintain the original order of the html
input, you can use recursion to loop over the soup.contents
attribute:
from bs4 import BeautifulSoup as soup
def parse(content, to_scrape = ['h1', 'h2', 'h3', 'p', 'li']):
if content.name in to_scrape:
yield content
for i in getattr(content, 'contents', []):
yield from parse(i)
Example:
html = """
<html>
<body>
<h1>My website</h1>
<p>This is my first site</p>
<h2>See a listing of my interests below</h2>
<ul>
<li>programming</li>
<li>math</li>
<li>physics</li>
</ul>
<h3>Thanks for visiting!</h3>
</body>
</html>
"""
result = list(parse(soup(html, 'html.parser')))
Output:
[<h1>My website</h1>, <p>This is my first site</p>, <h2>See a listing of my interests below</h2>, <li>programming</li>, <li>math</li>, <li>physics</li>, <h3>Thanks for visiting!</h3>]
As you can see, the original order of the html is maintained, and can now be written to the file:
with open('file.md', 'w') as f:
f.write('\n'.join(map(str, result)))
Each bs4
object contains a name
and contents
attribute, among others. The name
attribute is the tag name itself, while the contents
attribute stores all the child HTML. parse
uses a generator to first check if the passed bs4
object has a tag that belongs to the to_scrape
list and if so, yield
s that value. Lastly, parse
iterates over the contents of content
, and calls itself on each element.
Upvotes: 1