JJH
JJH

Reputation: 9

How to parse HTML elements?

I am looking to extract items listed under 'Categories' from a list of Github webpages.

In the sample code, I was able to identify the chunk of text that I need to parse but when I parse the text, the output looks like this:

['\n\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n ', '\n\n  Continuous integration\n\n\n  Security\n\n']

The output I am looking for is:

[Continuous integration, Security]

How can I change my get_text() line of code to get to the final result?

from bs4 import BeautifulSoup
import requests

websites = ['https://github.com/marketplace/actions/yq-portable-yaml-processor','https://github.com/marketplace/actions/TruffleHog-OSS']

for links in websites:
URL = requests.get(links)
detailsoup = BeautifulSoup(URL.content, "html.parser")

categories = detailsoup.findAll('div', {'class': 'ml-n1 clearfix'})
print(categories)
categoriesList = [categories.get_text() for categories in categories]
print(categoriesList)

# keep only 1st element & maintain type as list
categoriesList = categoriesList[1:2]
if not categoriesList:
    categoriesList.insert(0, 'Error')

Upvotes: 0

Views: 1726

Answers (1)

HedgeHog
HedgeHog

Reputation: 25073

Simply add parameter strip=True:

categoriesList = [categories.get_text(strip=True) for category in categories]

Also try to select your elements more specific:

categories = detailsoup.find_all('a', {'class': 'topic-tag'})

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs

Example

from bs4 import BeautifulSoup
import requests

websites = ['https://github.com/marketplace/actions/yq-portable-yaml-processor','https://github.com/marketplace/actions/TruffleHog-OSS']

for links in websites:
    URL = requests.get(links)
    detailsoup = BeautifulSoup(URL.content, "html.parser")

    categories = detailsoup.find_all('a', {'class': 'topic-tag'})
    categoriesList = [categories.get_text(strip=True) for category in categories]
    print(categoriesList)

Upvotes: 1

Related Questions