Reputation: 1155
I am trying to extract a
text sub-elements for the first table
element from an specific url.
The main goal is to iterate over all a
sub-elements in order to extract #text
as a list.
To accomplish this task I defined a function:
from bs4 import BeautifulSoup
import lxml
import requests
url = 'https://www.salario.com.br/profissao/abacaxicultor-cbo-612510/'
def getColumns(url):
url = requests.get(url)
soup=BeautifulSoup(url.text, 'lxml')
table = soup.find('table', attrs={'class':'listas'})
# getting salary table (first table on job_title.url)
# finding table headers
#first_header = table.find('thead').find('tr').find_all('strong')
all_other_headers = table.find('thead').find('tr').find('a').find_all('data-tooltip')
colnames = [hdr.text for hdr in all_other_headers]
return colnames
When applied getColumns(url)
it returns an empty list :[]
, when I expect an output that looks like this: ['Sálario Mensal', 'Sálario Anual', 'Salário por Semana', 'Sálario por Hora']
Why if I am using a
tag is still not working?
How could I possibly adjust this function?
Upvotes: 0
Views: 35
Reputation: 564
To me it looks like this is perhaps being made more complicated than necessary. In general, I find it easiest to think of parsing with BeautifulSoup as a progression of searching further and further down the document tree.
So for this example I'd think in terms of these steps
table
elementthead
elementtd
elements in the thead
a
elements and their string valueHere's an example of that approach which gives the output that you're looking for.
from bs4 import BeautifulSoup
import lxml
import requests
def getColumns(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find('table')
thead = table.find('thead')
headers = thead.find_all('td')
headerNames = [td.find('a').string for td in headers if td.find('a')]
return headerNames
headerNames = getColumns('https://www.salario.com.br/profissao/abacaxicultor-cbo-612510/')
print(headerNames)
Upvotes: 1