Dany M
Dany M

Reputation: 870

python how to get all images which are not part of the template

I'm looking for a way to extract all main images of a web page. the easy way is to do it with lxml

import lxml.html
import requests
html = requests.get('https://fr.wikipedia.org/wiki/Image').text()

tree = lxml.html.fromstring(html)
img = tree.xpath('//img[@src]']

this way we get all images, including logos, icons, pictos, sprite css...etc what I would like to get is only real images that are in the content. Any ideas? Thanks

Upvotes: 0

Views: 26

Answers (1)

Siebe Jongebloed
Siebe Jongebloed

Reputation: 4870

Use this:

//div[@id="mw-content-text"]//img[@src]

Upvotes: 1

Related Questions