Reputation: 9198
Given a string which is HTML (possibly malformed), how can I find the title
? This would seem quite simple yet I'm struggling to do so.
UPDATE: As requested, here are some URLs whose HTML Jsoup can't seem to find the title from. I collected their HTML about a month ago, so some may have changed.
http://www.miamitodaynews.com/news/050113/crossword.shtml ()
http://www.miamitodaynews.com/news/081218/cal-highlights.shtml/feed/ ()
http://www.miashoes.com/mia-limited-edition/flats.html?refineclr=2125%2C2136 ()
http://www.mica.edu/News/Workshop_on_111809_Archive_and_Inventory_Your_Image_Collections.html ()
http://www.michaelgeist.ca/2011/10/daily-digital-lock-15/ ()
http://www.michaelkors.com/bags/_/N-283g?cmCat=cat000000cat144cat44301cat44302&index=9&isEditorial=false ()
http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat35701cat30001&index=39&isEditorial=false ()
http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat7502&index=92&isEditorial=false ()
http://www.michaelmillerfabrics.com/catalog/seo_sitemap/product/?p=2 ()
http://www.michaels.com/10104250.html ()
http://www.menseffects.com/PROMETHEUS-2-Switchblade-Automatic-Knife-p/att00176a.htm (http://www.menseffects.com/PROMETHEUS-2-Switchblade-Automatic-Knife-p/att00176a.htm)
http://www.menstennisforums.com/misc.php?do=whoposted&t=16764 (http://www.menstennisforums.com/misc.php?do=whoposted&t=16764)
http://www.menstennisforums.com/showpost.php?p=12242018&postcount=115 (http://www.menstennisforums.com/showpost.php?p=12242018&postcount=115)
http://www.menstennisforums.com/showpost.php?p=12623891&postcount=13 (http://www.menstennisforums.com/showpost.php?p=12623891&postcount=13)
http://www.menstennisforums.com/showpost.php?p=13010289&postcount=5476 (http://www.menstennisforums.com/showpost.php?p=13010289&postcount=5476)
http://www.menstylepower.com/category/blog/page/14/ ()
http://www.menstylepower.com/tag/mens-loafers/ ()
http://www.memorysuppliers.com/product-tag/usb-drive/?filter_color=46%2C45&filter_double-sided-imprint=295 ()
http://www.memorysuppliers.com/usb-flash-drives/?filter_imprint-area=306&filter_material=291&filter_price=305 ()
http://www.memorysuppliers.com/usb-flash-drives/best-sellers/?filter_color=51%2C27&filter_material=290&filter_price=302 ()
http://www.memorysuppliers.com/usb-flash-drives/best-sellers/?filter_color=51&filter_imprint-area=306&filter_speed=296 ()
http://www.memorysuppliers.com/usb-flash-drives/capless/?filter_color=51%2C47&filter_double-sided-imprint=294&filter_speed=296 ()
http://www.memphisdailynews.com/Search/Search.aspx?fn=Cathy&ln=Rogers&redir=1 ()
http://www.memphisdailynews.com/Search/Search.aspx?redir=1&sno=931%20Frayser%20Blvd ()
http://www.memphisdailynews.com/Search/Search.aspx?redir=1&sno=314%2BS.%2BMain%2BSt ()
http://www.memphisdailynews.com/news/2012/dec/27/starbucks-cups-to-come-with-a-political-message/ ()
http://www.memphisdailynews.com/news/2014/mar/24/tigers-season-ends-on-common-theme-underachieved/ ()
http://www.memphismagazine.com/December-2006/Blade-Runner/ ()
Upvotes: 1
Views: 3810
Reputation: 1273
Trivially easy with the excellent jsoup. Have a look here.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class SoGetTitleFromString {
public static void main(String[] args) throws IOException {
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
String title = doc.title();
System.out.println("Title is: " + title);
}
}
Output:
Title is: First parse
Edit: OK, what you are trying to do is get a list of titles from a string of urls. The String that you are parsing is a list of urls, not html itself. Try this:
import java.io.IOException;
import java.util.Scanner;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class SoGetTitlesFromListOfUrls {
public static void main(String[] args) throws IOException {
String inUrls = "http://www.miamitodaynews.com/news/050113/crossword.shtml ()\n"
+ "http://www.miamitodaynews.com/news/081218/cal-highlights.shtml/feed/ ()\n"
+ "http://www.miashoes.com/mia-limited-edition/flats.html?refineclr=2125%2C2136 ()\n"
+ "http://www.mica.edu/News/Workshop_on_111809_Archive_and_Inventory_Your_Image_Collections.html ()\n"
+ "http://www.michaelgeist.ca/2011/10/daily-digital-lock-15/ ()\n"
+ "http://www.michaelkors.com/bags/_/N-283g?cmCat=cat000000cat144cat44301cat44302&index=9&isEditorial=false ()\n"
+ "http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat35701cat30001&index=39&isEditorial=false ()\n"
+ "http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat7502&index=92&isEditorial=false ()\n"
+ "http://www.michaelmillerfabrics.com/catalog/seo_sitemap/product/?p=2 ()\n"
+ "http://www.michaels.com/10104250.html ()\n";
Scanner UrlScanner = new Scanner(inUrls);
while (UrlScanner.hasNextLine()) {
String url = UrlScanner.nextLine().split(" ")[0]; // Get the first token from the line, space delimited
Document doc = Jsoup.connect(url).get();
String title = doc.title();
System.out.println("Title is: " + title);
}
}
}
Output:
Title is: Miami Today Crossword Answers - Miami Today
Title is: Comments on: Calendar Of Events Highlights
Title is: MIA LIMITED EDITION FLATS - WOMEN FLATS
Title is: Workshop on 11.18.09: Archive & Inventory Your Image Collections | MICA
Title is: The Daily Digital Lock Dissenter, Day 15: Canadian Bookseller Association - Michael Geist
Title is: Handbags - Crossbody to Clutches to Totes & More | Michael Kors
Title is: Watches by Michael Kors - Womens & Mens Luxury, Chic & Timeless Styles
Title is: Watches by Michael Kors - Womens & Mens Luxury, Chic & Timeless Styles
Title is: Site Map
Title is: Creatology™ 3D Foam Kit, Pirate Ship
Upvotes: 1
Reputation: 5569
Simplest way is to use a regular expression. Took this from java2s.com.
import java.io.DataInputStream;
import java.io.EOFException;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Scratch {
public static void main(String[] argv) throws Exception {
URL url = new URL("http://www.java.com/");
URLConnection urlConnection = url.openConnection();
DataInputStream dis = new DataInputStream(urlConnection.getInputStream());
String html = "", tmp = "";
try {
while ((tmp = dis.readUTF()) != null) {
html += " " + tmp;
}
} catch (EOFException e) {
// ignore
} finally {
dis.close();
}
html = html.replaceAll("\\s+", " ");
Pattern p = Pattern.compile("<title>(.*?)</title>");
Matcher m = p.matcher(html);
while (m.find() == true) {
System.out.println(m.group(1));
}
}
}
Upvotes: 1
Reputation: 641
Use an HTML parser for Java such as HTMLParser or use regular expressions to pull out the title from the malformed HTML string, maybe something like this (.*?)
Upvotes: 0