AlSub
AlSub

Reputation: 1155

How to extract text from a-html elements by using a customized function?

I am trying to extract a text sub-elements for the first table element from an specific url.

The main goal is to iterate over all a sub-elements in order to extract #text as a list. To accomplish this task I defined a function:

from bs4 import BeautifulSoup
import lxml
import requests

url = 'https://www.salario.com.br/profissao/abacaxicultor-cbo-612510/'

def getColumns(url):
    url = requests.get(url)
    soup=BeautifulSoup(url.text, 'lxml')
    table = soup.find('table', attrs={'class':'listas'})

     # getting salary table (first table on job_title.url)
      # finding table headers

    #first_header = table.find('thead').find('tr').find_all('strong')
    all_other_headers = table.find('thead').find('tr').find('a').find_all('data-tooltip')
    colnames = [hdr.text for hdr in all_other_headers]
   

    return colnames

When applied getColumns(url) it returns an empty list :[], when I expect an output that looks like this: ['Sálario Mensal', 'Sálario Anual', 'Salário por Semana', 'Sálario por Hora']

Why if I am using a tag is still not working? How could I possibly adjust this function?

Upvotes: 0

Views: 35

Answers (1)

Adam Richard
Adam Richard

Reputation: 564

To me it looks like this is perhaps being made more complicated than necessary. In general, I find it easiest to think of parsing with BeautifulSoup as a progression of searching further and further down the document tree.

So for this example I'd think in terms of these steps

  1. find the table element
  2. get the thead element
  3. find all td elements in the thead
  4. find all the a elements and their string value

Here's an example of that approach which gives the output that you're looking for.

from bs4 import BeautifulSoup
import lxml
import requests

def getColumns(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    table = soup.find('table')
    thead = table.find('thead')
    headers = thead.find_all('td')
    headerNames = [td.find('a').string for td in headers if td.find('a')]

    return headerNames

headerNames = getColumns('https://www.salario.com.br/profissao/abacaxicultor-cbo-612510/')
print(headerNames)

Upvotes: 1

Related Questions