Michael P.
Michael P.

Reputation: 15

Parsing XML with BufferedReader in Java

To begin with the XML file 2,84GB and none of SAX or DOM parser seems to be working. I've already tried them and every time crashes. So, I choose to read the file and export the data I want with BufferedReader, parsing the XML file like it is txt.

XML File(small part):

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2019-11-22.dtd">
<dblp>
<phdthesis mdate="2016-05-04" key="phd/dk/Heine2010">
<author>Carmen Heine</author>
<title>Modell zur Produktion von Online-Hilfen.</title>
<year>2010</year>
<school>Aarhus University</school>
<pages>1-315</pages>
<isbn>978-3-86596-263-8</isbn>
<ee>http://d-nb.info/996064095</ee>
</phdthesis><phdthesis mdate="2020-02-12" key="phd/Hoff2002">
<author>Gerd Hoff</author>
<title>Ein Verfahren zur thematisch spezialisierten Suche im Web und seine Realisierung im Prototypen HomePageSearch</title>
<year>2002</year>

From that XML file I want to retrieve the data which is between the tags <year>. I also used Pattern and Matcher with regEx to find out the information I want. My code so far:

public class Publications {
    public static void main(String[] args) throws IOException {
        File file = new File("dblp-2020-04-01.xml");
        FileInputStream fileStream = new FileInputStream(file);
        InputStreamReader input = new InputStreamReader(fileStream);
        BufferedReader reader = new BufferedReader(input);
        String line;
        String regex = "\\d+";


        // Reading line by line from the
        // file until a null is returned
        while ((line = reader.readLine()) != null) {
            final Pattern pattern = Pattern.compile("<year>(.+?)</year>", Pattern.DOTALL);
            final Matcher matcher = pattern.matcher("<year>"+regex+"</year>");
            matcher.find();
            System.out.println(matcher.group(1)); // Prints String I want to extract
            }
        }
}

After compiling , the results aren't what I expected to be. Instead of printing me the exact year everytime the parser finds the ... tag the results are the following:

\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+

Any suggestions?

Upvotes: 0

Views: 3239

Answers (2)

collapsar
collapsar

Reputation: 17238

Remark

Regexen are the wrong tool to extract information from xml (or similar structured formats). The general approach is not recommended. For the right way to handle it, cf. Michael Kay's answer.

Answer

You provide the wrong argument in constructing the matcher. Instead of the expression in your code you need to provide the current line:

// ...
final Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
    System.out.println(matcher.group(1)); // Prints String I want to extract
}
// ...

Note the extra conditional to check whether the current line does match at all.

Also note that the pattern you match against is defined in the Pattern constructor. Thus to match only <year> tags that contain numerical values, the line has to be changed to

final Pattern pattern = Pattern.compile("<year>(" + regex + ")</year>", Pattern.DOTALL);

Upvotes: 0

Michael Kay
Michael Kay

Reputation: 163468

Please don't try parsing XML using regular expressions. We get hundreds of questions on this forum from people trying to generate XML in peculiar formats because that's the only thing the receiving application can handle, and the reason the receiving application has such restrictions is that it's trying to do the XML parsing "by hand". You're storing up trouble for yourself, for the people you want to exchange data with, and for the people on StackOverflow that you will turn to for help when it all goes pear-shaped. XML standards exist for a reason, and work very well when everyone conforms to them.

The right approach in this case is a streaming XML approach, using SAX, StAX, or streaming XSLT 3.0, and you've abandoned those approaches for completely spurious reasons.

Upvotes: 2

Related Questions