Reputation: 125
Currently I need a program that given a URL, returns a list of all the images on the webpage.
ie:
logo.png gallery1.jpg test.gif
Is there any open source software available before I try and code something?
Language should be java. Thanks Philip
Upvotes: 9
Views: 18732
Reputation: 1
You can simply use regular expression in Java
<html>
<body>
<p>
<img src="38220.png" alt="test" title="test" />
<img src="32222.png" alt="test" title="test" />
</p>
</body>
</html>
String s ="html"; //above html content
Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
Matcher m = p.matcher (s);
while (m.find()) {
String src = m.group();
int startIndex = src.indexOf("src=") + 5;
String srcTag = src.substring(startIndex, src.length());
System.out.println( srcTag );
}
Upvotes: 0
Reputation: 532
With Open Graph tags and HTML unit, you can extract your data really easily (PageMeta is a simple POJO holding the results):
Parser parser = new Parser(url);
PageMeta pageMeta = new PageMeta();
pageMeta.setUrl(url);
NodeList meta = parser.parse(new TagNameFilter("meta"));
for (SimpleNodeIterator iterator = meta.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
if ("og:image".equals(tag.getAttribute("property"))) {
pageMeta.setImageUrl(tag.getAttribute("content"));
}
if ("og:title".equals(tag.getAttribute("property"))) {
pageMeta.setTitle(tag.getAttribute("content"));
}
if ("og:description".equals(tag.getAttribute("property"))) {
pageMeta.setDescription(tag.getAttribute("content"));
}
}
Upvotes: 0
Reputation: 570365
This is dead simple with HTML Parser (and any other decent HTML parser):
Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));
for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
System.out.println(tag.getAttribute("src"));
}
Upvotes: 4
Reputation: 1108722
Just use a simple HTML parser, like jTidy, and then get all elements by tag name img
and then collect the src
attribute of each in a List<String>
or maybe List<URI>
.
You can obtain an InputStream
of an URL
using URL#openStream()
and then feed it to any HTML parser you like to use. Here's a kickoff example:
InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();
for (int i = 0; i < imgs.getLength(); i++) {
srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}
for (String src: srcs) {
System.out.println(src);
}
I must however admit that HtmlUnit as suggested by Bozho indeed looks better.
Upvotes: 14
Reputation: 15623
You can parse the HTML, and collect all SRC attributes of IMG elements in a Collection. Then download each resource from each url and write it to a file. For parsing there are several HTML parsers available, Cobra is one of them.
Upvotes: 0
Reputation: 597106
HtmlUnit has HtmlPage.getElementsByTagName("img")
, which will probably suit you.
(read the short Get started guide to see how to obtain the correct HtmlPage
object)
Upvotes: 12