How to extract text from a-html elements by using a customized function?

Question

I am trying to extract a text sub-elements for the first table element from an specific url.

The main goal is to iterate over all a sub-elements in order to extract #text as a list. To accomplish this task I defined a function:

from bs4 import BeautifulSoup
import lxml
import requests

url = 'https://www.salario.com.br/profissao/abacaxicultor-cbo-612510/'

def getColumns(url):
    url = requests.get(url)
    soup=BeautifulSoup(url.text, 'lxml')
    table = soup.find('table', attrs={'class':'listas'})

     # getting salary table (first table on job_title.url)
      # finding table headers

    #first_header = table.find('thead').find('tr').find_all('strong')
    all_other_headers = table.find('thead').find('tr').find('a').find_all('data-tooltip')
    colnames = [hdr.text for hdr in all_other_headers]
   

    return colnames

When applied getColumns(url) it returns an empty list :[], when I expect an output that looks like this: ['Sálario Mensal', 'Sálario Anual', 'Salário por Semana', 'Sálario por Hora']

Why if I am using a tag is still not working? How could I possibly adjust this function?

Adam Richard · Accepted Answer

To me it looks like this is perhaps being made more complicated than necessary. In general, I find it easiest to think of parsing with BeautifulSoup as a progression of searching further and further down the document tree.

So for this example I'd think in terms of these steps

find the table element
get the thead element
find all td elements in the thead
find all the a elements and their string value

Here's an example of that approach which gives the output that you're looking for.

from bs4 import BeautifulSoup
import lxml
import requests

def getColumns(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    table = soup.find('table')
    thead = table.find('thead')
    headers = thead.find_all('td')
    headerNames = [td.find('a').string for td in headers if td.find('a')]

    return headerNames

headerNames = getColumns('https://www.salario.com.br/profissao/abacaxicultor-cbo-612510/')
print(headerNames)

How to extract text from a-html elements by using a customized function?

Answers (1)

Related Questions