Chris McAtackney
Chris McAtackney

Reputation: 5232

Parsing Large Text Files in Real-time (Java)

I'm interested in parsing a fairly large text file in Java (1.6.x) and was wondering what approach(es) would be considered best practice?

The file will probably be about 1Mb in size, and will consist of thousands of entries along the lines of;

Entry
{
    property1=value1
    property2=value2
    ...
}

etc.

My first instinct is to use regular expressions, but I have no prior experience of using Java in a production environment, and so am unsure how powerful the java.util.regex classes are.

To clarify a bit, my application is going to be a web app (JSP) which parses the file in question and displays the various values it retrieves. There is only ever the one file which gets parsed (it resides in a 3rd party directory on the host).

The app will have a fairly low usage (maybe only a handful of users using it a couple of times a day), but it is vital that when they do use it, the information is retrieved as quickly as possible.

Also, are there any precautions to take around loading the file into memory every time it is parsed?

Can anyone recommend an approach to take here?

Thanks

Upvotes: 7

Views: 10617

Answers (9)

Chii
Chii

Reputation: 14738

the other solution is to do some form of preprocessing (done offline, or as a cron job) which produces a very optimized data structure, which is then used to serve the many web requests (without having to reparse the file).

though, looking at the scenario in question, that doesnt seem to be needed.

Upvotes: 1

Neil Coffey
Neil Coffey

Reputation: 21795

If it's going to be about 1MB and literally in the format you state, then it sounds like you're overengineering things.

Unless your server is a ZX Spectrum or something, just use regular expressions to parse it, whack the data in a hash map (and keep it there), and don't worry about it. It'll take up a few megabytes in memory, but so what...?

Update: just to give you a concrete idea of performance, some measurements I took of the performance of String.split() (which uses regular expressions) show that on a 2GHz machine, it takes milliseconds to split 10,000 100-character strings (in other words, about 1 megabyte of data -- actually nearer 2MB in pure volume of bytes, since Strings are 2 bytes per char). Obvioualy, that's not quite the operation you're performing, but you get my point: things aren't that bad...

Upvotes: 8

Alan Moore
Alan Moore

Reputation: 75232

If it's the limitations of Java regexes you're wondering about, don't worry about it. Assuming you're reasonably competent at crafting regexes, performance shouldn't be a problem. The feature set is satisfyingly rich, too--including my favorite, possessive quantifiers.

Upvotes: 1

Yuval F
Yuval F

Reputation: 20621

This seems like a simple enough file format, so you may consider using a Recursive Descent Parser. Compared to JavaCC and Antlr, its pros are that you can write a few simple methods, get the data you need, and you do not need to learn a parser generator formalism. Its cons - it may be less efficient. A recursive descent parser is in principle stronger than regular expressions. If you can come up with a grammar for this file type, it will serve you for whatever solution you choose.

Upvotes: 1

pgras
pgras

Reputation: 12770

Not answering the question about parsing ... but you could parse the files and generate static pages as soon as new files arrive. So you would have no performance problems... (And I think 1Mb isn't a big file so you can load it in memory, as long as you don't load too many files concurrently...)

Upvotes: 1

paweloque
paweloque

Reputation: 18864

You can use the Antlr parser generator to build a parser capable of parsing your files.

Upvotes: 2

mP.
mP.

Reputation: 18266

Use the Scanner class and process your file a line at a time. Im not sure why you mentioned regex. Regex is almost never the right answer to any parsing question because of the ambiguity and lack of symmantic contorl over whats happening in what context.

Upvotes: 3

Brian Agnew
Brian Agnew

Reputation: 272297

I'm wondering why this isn't in XML, and then you could leverage off the available XML tooling. I'm thinking particularly of SAX, in which case you could easily parse/process this without holding it all in memory.

So can you convert this to XML ?

If you can't, and you need a parser, then take a look at JavaCC

Upvotes: 4

Lucero
Lucero

Reputation: 60190

If it is a proper grammar, use a parser builder such as the GOLD Parsing System. This allows you to specify the format and use an efficient parser to get the tokens you need, getting error-handling almost for free.

Upvotes: 5

Related Questions