Ethan
Ethan

Reputation: 27

How can I 'uncomment' the contents of a comment with Beautifulsoup?

I'm using BeautifulSoup from bs4 version: '4.10.0'

I'm doing some scraping for a project that I'm developing, and I encountered a problem, some elements that I scraped are commented for some reason.

<div class="h-[125] js-scroll-hidden" id="link-index-40">

<!--            <a href="/md5/dc0bbd5373a5ada24373640dab8defb3" class="custom-a flex items-center relative left-[-10] px-[10] py-2 hover:bg-[#00000011] ">
              <div class="flex-none">
                <div class="relative overflow-hidden w-[72] h-[108] flex flex-col justify-center">
                  <div class="absolute w-[100%] h-[90]" style="background-color: hsl(63deg 43% 73%)"></div>
                  <img class="relative inline-block" src="https://libgen.rs/covers/2274000/dc0bbd5373a5ada24373640dab8defb3-g.jpg" alt="" referrerpolicy="no-referrer" onerror="this.parentNode.removeChild(this)" loading="lazy" decoding="async"/>
                </div>
              </div>
              <div class="relative top-[-1] pl-4 grow overflow-hidden">
                <div class="truncate text-xs text-gray-500">English [en], pdf, 11.7MB, &#34;The Dale Carnegie course in effective spea - Dale Carnegie.pdf&#34;</div>
                <h3 class="truncate text-xl font-bold">The Dale Carnegie course in effective speaking, human relations and developing courage and confidence, improving your memory, leadership training : how the course is conducted and what you do at each session</h3>
                <div class="truncate text-sm">Dale Carnegie, 1989</div>
                <div class="truncate italic">Dale Carnegie &amp; Associates, Inc.</div>
              </div>
            </a>

--> </div>

I've been searching but every answer that I found, they were trying to eliminate all the contents but that's not my case.

I've tried different ways to eliminate the comments, but none of were successful.

I've tried to change the content of the tag to match the tags that have the desired format, It seemed fine at first, but it totally breaks the functionality of the methods, .find() or .find_all(), which I need for later.

I tried to find in the contents the symbols of the comments to see if I can change them manually, but they didn't appeared, I found a way to get the information but is really intensive for what I want to do, it requires transform the content of the tag which has the information and then parse it through BeautifulSoup, but I need to do it for +200 elements I need it to do it relatively quickly.

this will be my desired result:

<div class="h-[125] js-scroll-hidden" id="link-index-40">

            <a href="/md5/dc0bbd5373a5ada24373640dab8defb3" class="custom-a flex items-center relative left-[-10] px-[10] py-2 hover:bg-[#00000011] ">
              <div class="flex-none">
                <div class="relative overflow-hidden w-[72] h-[108] flex flex-col justify-center">
                  <div class="absolute w-[100%] h-[90]" style="background-color: hsl(63deg 43% 73%)"></div>
                  <img class="relative inline-block" src="https://libgen.rs/covers/2274000/dc0bbd5373a5ada24373640dab8defb3-g.jpg" alt="" referrerpolicy="no-referrer" onerror="this.parentNode.removeChild(this)" loading="lazy" decoding="async"/>
                </div>
              </div>
              <div class="relative top-[-1] pl-4 grow overflow-hidden">
                <div class="truncate text-xs text-gray-500">English [en], pdf, 11.7MB, &#34;The Dale Carnegie course in effective spea - Dale Carnegie.pdf&#34;</div>
                <h3 class="truncate text-xl font-bold">The Dale Carnegie course in effective speaking, human relations and developing courage and confidence, improving your memory, leadership training : how the course is conducted and what you do at each session</h3>
                <div class="truncate text-sm">Dale Carnegie, 1989</div>
                <div class="truncate italic">Dale Carnegie &amp; Associates, Inc.</div>
              </div>
            </a>

</div>

I found this answer How can I find a comment with specified text string, but for my project it will be intensive.

Is there a way to do it natively in BeautifulSoup without changing data types or nothing very resource intensive? , ( I'm willing to use another package is there is another that is easier to deal with this situations )

Upvotes: 0

Views: 57

Answers (1)

Michael Moreno
Michael Moreno

Reputation: 1349

Use .replace()

html = '''
<body>
    <div>
        <!-- <div></div>
        <div></div>
        <div></div> -->
    </div>
</body>
'''

def remove_comments(html: str):
    return html.replace('<!--', '').replace('-->', '')

remove_comments(html)

result:

'''
<body>
    <div>
         <div></div>
        <div></div>
        <div></div> 
    </div>
</body>
'''

Upvotes: 1

Related Questions