Aedam
Aedam

Reputation: 151

BS4 Scraper is producing html of the entire div code, not just the href link

The code for the website is here: https://i.sstatic.net/FEIAa.png

The code I am using:

import requests
import time
from bs4 import BeautifulSoup
import sys

sys.stdout = open("links.txt", "a")

for x in range(0, 2):
    try:
        URL = f'https://link.com/{x}'
        page = requests.get(URL)

        soup = BeautifulSoup(page.content, 'html.parser')

        rows = soup.find_all('div', id='view')
        for row in rows:
            print(row.text)
        time.sleep(5)
    except:
        continue

I just want an output of the list of links as shown in the highlighted code. But instead it results in the entire view code, not just the HREF, which is what I want.

Example of output that it produces:

<div id="view">
<a href="/watch/8f310ba6dfsdfsdfsdf" target="_blank"><img src="/thumbs/jpg/8f310ba6dfsdfsdfsdf.jpg" width="300"/></a>
...
...

When what I want it to produce is:

/watch/8f310ba6dfsdfsdfsdf
...
...

Upvotes: 1

Views: 53

Answers (3)

KunduK
KunduK

Reputation: 33384

Use following code which will find all anchor tag under div tag and then get the href value.

soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find('div',id='view').find_all('a'):
    print(links['href'])

If you Bs4 4.7.1 or above you can use following css selector.

soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.select('#view>a'):
    print(links['href'])

Upvotes: 2

Rajarishi Devarajan
Rajarishi Devarajan

Reputation: 581

By extracting the href attribute of the a inside the div you can get your desired result

rows = soup.find_all('div', id='view')
for row in rows:
    links = row.find_all('a')
    for link in links:
        print(link['href'])

Upvotes: 0

Ahmed Soliman
Ahmed Soliman

Reputation: 1710

You are retrieving the whole content of the div tag so if you want to get the links within the div then you need to add the a tag to the css seelctor as follows :

links = soup.select('div[id="view"] a')
for link in links :
    print(link.get('href'))

Upvotes: 0

Related Questions