hoge6b01
hoge6b01

Reputation: 127

Combine multiple BeautifulSoup calls

I want to iterate over a webpage. I use soup to find/select the tags in the html. For now, I have the two separated statements. But I'd like to have it done in one statement so I dont have to iterate over the same page twice. My code is the following:

headers = ({'User-Agent':
        'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})

sapo="https://casa.sapo.pt/comprar-apartamentos/ofertas-recentes/distrito.lisboa/?pn=1"
soup = BeautifulSoup(response.text, 'html.parser')

data1 = [json.loads(x.string) for x in soup.find_all("script", type="application/ld+json")]
data2= soup.select('div.property')
del  data1[:2]

There are 25 properties on the page. data1 returns 27 results, whereas the first 2 results are just overhead, so I delete them. So I have 25 results with 10 "columns". Now I'd like to have the data2 as an 11th column.

How could I achieve this?

Upvotes: 1

Views: 80

Answers (1)

HedgeHog
HedgeHog

Reputation: 25073

I am not sure why you like to get the whole HTML element, but here we go. Change your strategy selecting elements and start withe the containers:

data = []

for e in soup.select('div.property'):
    d = {'html':e}
    d.update(json.loads(e.script.string))
    data.append(d)

pd.DataFrame(data)

EDIT

Based on your comment extract the href via

d = {'link':'https://casa.sapo.pt'+e.a.get('href')}

data = []

for e in soup.select('div.property'):
    d = {'link':'https://casa.sapo.pt'+e.a.get('href')}
    d.update(json.loads(e.script.string))
    data.append(d)

pd.DataFrame(data)

Upvotes: 1

Related Questions