Parsing an XML file using Apache Tika

Question

I am crawling a webpage and after crawling it extract all the links from that webpage and then I am trying to parse all the url using Apache Tika and BoilerPipe by using below code so for some url it is parsing very well but for few XML I got the following error. I am not sure what does this error means. Some problem with my code or some problem with the XML file? And this is the below line number 100 in HTML Parser.java

String parsedText = tika.parseToString(htmlStream, md);

Error that I am having-

org.apache.tika.exception.TikaException: Invalid XML: Error on line 16: Invalid byte 1 of 1-byte UTF-8 sequence.
        at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:75)

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
        at org.apache.tika.Tika.parseToString(Tika.java:357)
        at edu.uci.ics.crawler4j.crawler.HTMLParser.parse(HTMLParser.java:101)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.handleHtml(WebCrawler.java:227)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:299)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:118)
        at java.lang.Thread.run(Unknown Source)

HTMLParser.java code-

public void parse(String htmlContent, String contextURL) {

    InputStream htmlStream = null;
    text = null;
    title = null;
    metaData = new HashMap();

    urls = new HashSet();
    char[] chars = htmlContent.toCharArray();

    bulletParser.setCallback(textExtractor);
    bulletParser.parse(chars);

    try {
        text = articleExtractor.getText(htmlContent);
    } catch (BoilerpipeProcessingException e) {
        e.printStackTrace();
    }

    if (text == null){
        text = textExtractor.text.toString().trim(); 
    }

    title = textExtractor.title.toString().trim();
    try {
        Metadata md = new Metadata();
        htmlStream = new ByteArrayInputStream(htmlContent.getBytes());
        String parsedText = tika.parseToString(htmlStream, md);
        //very unlikely to happen
        if (text == null){
            text = parsedText.trim();
        }
        processMetaData(md);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        IOUtils.closeQuietly(htmlStream);
    }
    bulletParser.setCallback(linkExtractor);
    bulletParser.parse(chars);
    Iterator it = linkExtractor.urls.iterator();

    String baseURL = linkExtractor.base();
    if (baseURL != null) {
        contextURL = baseURL;
    }

    int urlCount = 0;
    while (it.hasNext()) {
        String href = it.next();
        href = href.trim();
        if (href.length() == 0) {
            continue;
        }
        String hrefWithoutProtocol = href.toLowerCase();
        if (href.startsWith("http://")) {
            hrefWithoutProtocol = href.substring(7);
        }
        if (hrefWithoutProtocol.indexOf("javascript:") < 0
                && hrefWithoutProtocol.indexOf("@") < 0) {
            URL url = URLCanonicalizer.getCanonicalURL(href, contextURL);
            if (url != null) {
                urls.add(url.toExternalForm());
                urlCount++;
                if (urlCount > MAX_OUT_LINKS) {
                    break;
                }   
            }               
        }
    }
}

Paul Croarkin · Accepted Answer

Try changing

htmlStream = new ByteArrayInputStream(htmlContent.getBytes());

to

String utfHtmlContent = new String(htmlContent.getBytes(),"UTF-8")
htmlStream = new ByteArrayInputStream(utfHtmlContent.getBytes());

This may be a hack and you may not want to use it as your final solution, but it if it starts to work after this change you will know that the input was not originally UTF-8.

Parsing an XML file using Apache Tika

Answers (2)

Related Questions