Reputation: 24154
I am crawling a webpage and after crawling it extract all the links from that webpage and then I am trying to parse all the url using Apache Tika and BoilerPipe by using below code so for some url it is parsing very well but for few XML I got the following error. I am not sure what does this error means. Some problem with my code or some problem with the XML file? And this is the below line number 100 in HTML Parser.java
String parsedText = tika.parseToString(htmlStream, md);
Error that I am having-
org.apache.tika.exception.TikaException: Invalid XML: Error on line 16: Invalid byte 1 of 1-byte UTF-8 sequence.
at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:75)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.Tika.parseToString(Tika.java:357)
at edu.uci.ics.crawler4j.crawler.HTMLParser.parse(HTMLParser.java:101)
at edu.uci.ics.crawler4j.crawler.WebCrawler.handleHtml(WebCrawler.java:227)
at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:299)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:118)
at java.lang.Thread.run(Unknown Source)
HTMLParser.java code-
public void parse(String htmlContent, String contextURL) {
InputStream htmlStream = null;
text = null;
title = null;
metaData = new HashMap<String, String>();
urls = new HashSet<String>();
char[] chars = htmlContent.toCharArray();
bulletParser.setCallback(textExtractor);
bulletParser.parse(chars);
try {
text = articleExtractor.getText(htmlContent);
} catch (BoilerpipeProcessingException e) {
e.printStackTrace();
}
if (text == null){
text = textExtractor.text.toString().trim();
}
title = textExtractor.title.toString().trim();
try {
Metadata md = new Metadata();
htmlStream = new ByteArrayInputStream(htmlContent.getBytes());
String parsedText = tika.parseToString(htmlStream, md);
//very unlikely to happen
if (text == null){
text = parsedText.trim();
}
processMetaData(md);
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(htmlStream);
}
bulletParser.setCallback(linkExtractor);
bulletParser.parse(chars);
Iterator<String> it = linkExtractor.urls.iterator();
String baseURL = linkExtractor.base();
if (baseURL != null) {
contextURL = baseURL;
}
int urlCount = 0;
while (it.hasNext()) {
String href = it.next();
href = href.trim();
if (href.length() == 0) {
continue;
}
String hrefWithoutProtocol = href.toLowerCase();
if (href.startsWith("http://")) {
hrefWithoutProtocol = href.substring(7);
}
if (hrefWithoutProtocol.indexOf("javascript:") < 0
&& hrefWithoutProtocol.indexOf("@") < 0) {
URL url = URLCanonicalizer.getCanonicalURL(href, contextURL);
if (url != null) {
urls.add(url.toExternalForm());
urlCount++;
if (urlCount > MAX_OUT_LINKS) {
break;
}
}
}
}
}
Upvotes: 2
Views: 4639
Reputation: 1092
The exception is coming from the FeedParser class, which indicates that the resource you're trying to parse is an RSS or Atom feed, not a HTML document.
Based on the exception it seems likely that you're dealing with a malformed feed that declares itself to be UTF-8 (with a <?xml version="1.0" encoding="UTF-8"?>
prefix), but then contains content in some other non-UTF-8 encoding. Given the draconian parsing rules of XML this feed can not be parsed, and thus the TikaException you receive is as expected.
For more details about the problem I suggest you point the feed validator to the troublesome URL.
Upvotes: 1
Reputation: 14675
Try changing
htmlStream = new ByteArrayInputStream(htmlContent.getBytes());
to
String utfHtmlContent = new String(htmlContent.getBytes(),"UTF-8")
htmlStream = new ByteArrayInputStream(utfHtmlContent.getBytes());
This may be a hack and you may not want to use it as your final solution, but it if it starts to work after this change you will know that the input was not originally UTF-8.
Upvotes: 1