rbutrnz
rbutrnz

Reputation: 393

How to grab data from html files using python

I have the snippet that extract links from my html file and I wanted to add some data into the result. I had been trying to search as to how to come up with the improvements but not successful.

Any idea will be very helpful and very very welcome. Thank you.

from bs4 import BeautifulSoup
import re, codecs

srcfile = 'sourcefile.html'
URL = open(srcfile,'r', encoding="utf-8")
soup = BeautifulSoup(URL, "html.parser")
count = 0

for a_href in soup.find_all("a", href=re.compile('https://bscscan\.com/token/')):
    print("BscScan: ", a_href["href"])

Current Output:

BscScan: https://bscscan.com/token/0xd7b0B9d1F011ec19312836F09Ef24a6494da0B8F
BscScan: https://bscscan.com/token/0x6679777D2D59B80302164284a9494a2080350225

Output With Additional Data:

Name: BigDustin                         Total Supply: 10,000,000,000 BIGD
Liquidity: 5.0000 BNB ($2909.3682)      Holders: 1          Transfers: 1
  BscScan: https://bscscan.com/token/0xd7b0B9d1F011ec19312836F09Ef24a6494da0B8F

Name: SUPERHIT SHIBA                    Total Supply: 100,000,000 SUPERHIT
Liquidity: 1.0100 BNB ($587.6924)       Holders: 2          Transfers: 2
  BscScan: https://bscscan.com/token/0x6679777D2D59B80302164284a9494a2080350225

sourcefile.html #-- local .html file

🆕 <u>New token</u><br><br><strong>Version</strong>: V2<br><br><strong>Pair</strong>: WBNB-BIGD<br><strong>Liquidity</strong>: 5.0000 BNB ($2909.3682)<br>ℹ️ <a href="https://bscscan.com/tx/0xba7ed738e744e5899138529a3051ee3b9d2bdc9512ffb8e649d9c291dfe26b14">Transaction</a><br><br><strong>Name</strong>: BigDustin<br><strong>Total Supply</strong>: 10,000,000,000 <strong>BIGD</strong><br><strong>Token Price</strong>: 0.0000 BNB ($0.0000)<br><br><strong>Holders</strong>: 1<br><strong>Transfers</strong>: 1<br><br>⛓ <a href="https://bscscan.com/token/0xd7b0B9d1F011ec19312836F09Ef24a6494da0B8F">BscScan</a><br><br>🥞 <a href="https://exchange.pancakeswap.finance/#/swap?outputCurrency=0xd7b0B9d1F011ec19312836F09Ef24a6494da0B8F">Swap on PancakeSwap</a><br><br>➡️ <a href="https://poocoin.app/tokens/0xd7b0B9d1F011ec19312836F09Ef24a6494da0B8F">poocoin.app</a><br><br>0xd7b0B9d1F011ec19312836F09Ef24a6494da0B8F<br>-----------------------------------<br>Our Main Info Channel - <a href="https://t.me/YourCryptoHelper">YourCryptoHelper</a>
       </div>
      </div>
     </div>
     <div class="message default clearfix joined" id="message473415">
      <div class="body">
       <div class="pull_right date details" title="20.11.2021 03:44:47">03:44
       </div>
       <div class="text">
🆕 <u>New token</u><br><br><strong>Version</strong>: V2<br><br><strong>Pair</strong>: SUPERHIT -WBNB<br><strong>Liquidity</strong>: 1.0100 BNB ($587.6924)<br>ℹ️ <a href="https://bscscan.com/tx/0xcbddd72c16dafd622cb8ba815f68c5139b2d080943a544dfd2eb7f7f1aea86de">Transaction</a><br><br><strong>Name</strong>: SUPERHIT SHIBA<br><strong>Total Supply</strong>: 100,000,000 <strong>SUPERHIT </strong><br><strong>Token Price</strong>: 0.0000 BNB ($0.0000)<br><br><strong>Holders</strong>: 2<br><strong>Transfers</strong>: 2<br><br>⛓ <a href="https://bscscan.com/token/0x6679777D2D59B80302164284a9494a2080350225">BscScan</a><br><br>🥞 <a href="https://exchange.pancakeswap.finance/#/swap?outputCurrency=0x6679777D2D59B80302164284a9494a2080350225">Swap on PancakeSwap</a><br><br>➡️ <a href="https://poocoin.app/tokens/0x6679777D2D59B80302164284a9494a2080350225">poocoin.app</a><br><br>0x6679777D2D59B80302164284a9494a2080350225<br>-----------------------------------<br>Our Main Info Channel - <a href="https://t.me/YourCryptoHelper">YourCryptoHelper</a>
       </div>
      </div>
     </div>

Upvotes: 2

Views: 1092

Answers (1)

Rusticus
Rusticus

Reputation: 382

How about starting from the "New token" tag and following along the chain of tags using "nextSibling", for example:

for u in soup.select('u'):
    s = u.nextSibling
    while s and s.name != 'u':
        if s.name == 'strong':
            key = s.text.strip() if s.text else ""
            s = s.nextSibling
            value = s.text.strip() if s.text else ""
            print(key, value)
        s = s.nextSibling

Result:

Version : V2
Pair : WBNB-BIGD
Liquidity : 5.0000 BNB ($2909.3682)
Name : BigDustin
Total Supply : 10,000,000,000
BIGD
Token Price : 0.0000 BNB ($0.0000)
Holders : 1
Transfers : 1
Version : V2
Pair : SUPERHIT -WBNB
Liquidity : 1.0100 BNB ($587.6924)
Name : SUPERHIT SHIBA
Total Supply : 100,000,000
SUPERHIT
Token Price : 0.0000 BNB ($0.0000)
Holders : 2
Transfers : 2

Upvotes: 1

Related Questions