user16312732
user16312732

Reputation:

How to extract some url from html?

I need to extract all image links from a local html file. Unfortunately, I can't install bs4 and cssutils to process html.

html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""

I tried to extract data using a regex:

images = []
for line in html.split('\n'):
    images.append(re.findall(r'(https://s2.*\?lastmod=\d+)', line))
print(images)

[['https://s2.example.com/path/image0.jpg?lastmod=1625296911'],
 ['https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912']]

I suppose my regular expression is greedy because I used .*? How to get the following outcome?

images = ['https://s2.example.com/path/image0.jpg',
          'https://s2.example.com/path/image1.jpg',
          'https://s2.example.com/path/image2.jpg',
          'https://s2.example.com/path/image3.jpg']

If it can help all links are enclosed by src="..." or url(...)

Thanks for your help.

Upvotes: 3

Views: 122

Answers (3)

import re
xx = '<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911" alt="asdasd"><img a src="https://s2.example.com/path/image0.jpg?lastmod=1625296911">'
r1 = re.findall(r"<img(?=\s|>)[^>]*>",xx)
url = []
for x in r1:
  x = re.findall(r"src\s{0,}=\s{0,}['\"][\w\d:/.=]{0,}",x)
  if(len(x)== 0): continue
  x = re.findall(r"http[s]{0,1}[\w\d:/.=]{0,}",x[0])
  if(len(x)== 0): continue
  url.append(x[0])
print(url)

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627468

You can use

import re
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
images = re.findall(r'https://s2[^\s?]*(?=\?lastmod=\d)', html)
print(images)

See the Python demo. Output:

['https://s2.example.com/path/image0.jpg',
 'https://s2.example.com/path/image1.jpg',
 'https://s2.example.com/path/image2.jpg', 
 'https://s2.example.com/path/image3.jpg']

See the regex demo, too. It means

  • https://s2 - some literal text
  • [^\s?]* -zero or more chars other than whitespace and ? chars
  • (?=\?lastmod=\d) - immediately to the right, there must be ?lastmode= and a digit (the text is not added to the match since it is a pattern inside a positive lookahead, a non-consuming pattern).

Upvotes: 0

hasleron
hasleron

Reputation: 501

import re
indeces_start = sorted(
    [m.start()+5 for m in re.finditer("src=", html)]
    + [m.start()+4 for m in re.finditer("url", html)])
indeces_end = [m.end() for m in re.finditer(".jpg", html)]

image_list = []

for start,end in zip(indeces_start,indeces_end):
  image_list.append(html[start:end])

print(image_list)

That's a solution which comes to my mind. It consists of finding the start and end indeces of the image path strings. It obviously has to be adjusted if there are different image types.

Edit: Changed the start criteria, in case there are other URLs in the document

Upvotes: 1

Related Questions