Maymay
Maymay

Reputation: 13

Extracting links from HTML in Python

i have to basically make a program that take a user-input web address and parses html to find links . then stores all the links in another HTML file in a certain format. i only have access to builtin python modules (python 3) . im able to get the HTML code from the link using urllib.request and put that into a string. how would i actually go about extracting links from this string and putting them into a string array? also would it be possible to identify links (such as an image link / mp3 link) so i can put them into different arrays (then i could catagorize them when im creating the output file)

Upvotes: 1

Views: 744

Answers (2)

Odai Al-Ghamdi
Odai Al-Ghamdi

Reputation: 302

try to use HTML.Parser library or re library they will help you to do that and i think you can use regex to do it

r'http[s]?://[^\s<>"]+|www.[^\s<>"]+

Upvotes: 1

korylprince
korylprince

Reputation: 3009

You can use the re module to parse the HTML text for links. Particularly the findall method can return every match.

As far as sorting by file type that depends on whether the url actually contains the extension (i.e. .mp3, .js, .jpeg, etc...)

You could do a simple for loop like such:

import re
html = getHTMLText()
mp3s = []
other = []
for match in re.findall('<reexpression>',html):
    if match.endswith('.mp3'):
        mp3s.append(match)
    else:
        other.append(match)

Upvotes: 1

Related Questions