wumbo__0
wumbo__0

Reputation: 5

How can I extract text from a span tag that doesnt have a title?

I'm trying to extract the cve from this page and a few others. Here is the link. https://www.tenable.com/plugins/nessus/19090 However, the cve doesn't seem to have a title or anything to allow me to grab the text for it. Is there a way to do this? Here is what the html for the cve looks like.

<section>
    <h4 class="u-m-t-2">Reference Information</h4>
    <section>
        <p><strong>CVE
                <!-- -->:
            </strong><span><a href="/cve/CVE-2004-0804">CVE-2004-0804</a></span></p>
    </section>
    <section></section>
    <div>
        <section>
            <p><strong>CERT
                    <!-- -->:
                </strong><span><a target="_blank" rel="noopener noreferrer" href="https://www.kb.cert.org/vuls/id/555304">555304</a></span></p>
        </section>
    </div>
</section>

EDIT: Here is my code currently with Jack Ashtons suggestion.

import bs4 as bs
from urllib.request import urlopen, Request
import urllib
import sys
import re

with open("path to file with id's") as f:
    for line in f:
        active = line
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
        reg_url = "https://www.tenable.com/plugins/nessus/" + str(active) 
        req = Request(url=reg_url, headers=headers) 
        try:
            source = urlopen(req).read()
        except urllib.error.HTTPError as e:
            if e.getcode() == 404: # check the return code  
                continue
            if e.getcode() == 502:  
                continue        
            raise

        soup = bs.BeautifulSoup(source,'lxml')
        result = re.search(r"<span>(.*CVE.*)</span>", soup)
        print(result[0])

Upvotes: 0

Views: 86

Answers (3)

Andrej Kesely
Andrej Kesely

Reputation: 195573

import requests
from bs4 import BeautifulSoup

url = 'https://www.tenable.com/plugins/nessus/19090'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

print( soup.select_one('a[href*="/cve/CVE"]').text )

Prints:

CVE-2004-0804

Or:

print( soup.select_one('strong:contains("CVE:") + span').text )

Or:

print( soup.select_one('h4:contains("Reference Information") + * span').text )

Upvotes: 1

from bs4 import BeautifulSoup
import requests


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    target = [
        f"{url[:23]+x['href']}" for x in soup.select("a[href^=\/cve\/CVE-]")]
    print(target)


main("https://www.tenable.com/plugins/nessus/19090")

Output:

['https://www.tenable.com/cve/CVE-2004-0804']

Upvotes: 0

Jack Ashton
Jack Ashton

Reputation: 156

With python here is a way to extract the CVE from this page. I'm not sure what the CVE is and what you want from it but since you know that "CVE" will be in the href / in the text of the tag you can search for that with regex. You can modify this to your liking this is just to get started.

import re

test = """
    <section>
        <h4 class="u-m-t-2">Reference Information</h4>
        <section>
        <p><strong>CVE
            <!-- -->:
            </strong><span><a href="/cve/CVE-2004-0804">CVE-2004-0804</a></span></p>
        </section>
        <section></section>
     <div>
    <section>
        <p><strong>CERT
                <!-- -->:
            </strong><span><a target="_blank" rel="noopener noreferrer" href="https://www.kb.cert.org/vuls/id/555304">555304</a></span></p>
        </section>
    </div>
  </section>
"""

result = re.search(r"<span>(.*CVE.*)</span>", test)
print(result[0])  # <a href="/cve/CVE-2004-0804">CVE-2004-0804</a>

Upvotes: 1

Related Questions