Web Scraping tag of html in Python

Question

I would like to scrape all the links that end with .php I have written a regrex to select the target url such as samsung-phones-f-9-0-r1-p1.php

I am wondering if there's something wrong with my regrex or the tag is not correct.

Thank you so much in advance for answering

from bs4 import BeautifulSoup
import urllib.request as urlopen
import ssl 
import re

base_url = 'https://www.gsmarena.com/samsung-phones-9.php'
webrequest = request.Request(url, headers = {
    "User-Agent" : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36'})
    
    
# open the url
html = request.urlopen(base_url).read().decode('utf-8')
soup = BeautifulSoup(html, features = 'lxml')
# scraping sub urls
sub_urls = soup.find_all('a', {"href": re.compile("(samsung).+(.php)")})
# https:\/\/www\.gsmarena\.com\/samsung.+(.php)
print(sub_urls)

idar · Accepted Answer

You are doing it right but you are not extracting the actual href property from the tags.
Modify this line:

sub_urls = soup.find_all('a', {"href": re.compile("(samsung).+(.php)")})

to this:

sub_urls = [x.get('href') for x in soup.find_all('a', {"href": re.compile("(samsung).+(.php)")})]

Web Scraping tag of html in Python

Answers (2)

Related Questions