Reputation: 13
i have to basically make a program that take a user-input web address and parses html to find links . then stores all the links in another HTML file in a certain format. i only have access to builtin python modules (python 3) . im able to get the HTML code from the link using urllib.request and put that into a string. how would i actually go about extracting links from this string and putting them into a string array? also would it be possible to identify links (such as an image link / mp3 link) so i can put them into different arrays (then i could catagorize them when im creating the output file)
Upvotes: 1
Views: 744
Reputation: 302
try to use HTML.Parser library or re library they will help you to do that and i think you can use regex to do it
r'http[s]?://[^\s<>"]+|www.[^\s<>"]+
Upvotes: 1
Reputation: 3009
You can use the re module to parse the HTML text for links. Particularly the findall
method can return every match.
As far as sorting by file type that depends on whether the url actually contains the extension (i.e. .mp3, .js, .jpeg, etc...)
You could do a simple for loop like such:
import re
html = getHTMLText()
mp3s = []
other = []
for match in re.findall('<reexpression>',html):
if match.endswith('.mp3'):
mp3s.append(match)
else:
other.append(match)
Upvotes: 1