Reputation: 313
I have many large YAML files to parse. They each consist of two documents (separated with ---
), the second of which is a long list of words. Some words have various attributes. For example:
[...]
---
- a:
- aardvark:
pos: n
def: A nocturnal, burrowing mammal that eats ants.
- aardvarks:
pos: n
flags:
- INFL
- aback:
pos: adv
flags:
- PHR_SPEC
- abacus:
- abacuses:
flags:
- INFL
[...]
I am aware of multiple YAML parsers for Java: YamlBeans, SnakeYAML Engine, Jackson, and eo-yaml. However, it seems that all of these parsers produce some structure of Map
, List
, and Set
objects (or perhaps custom objects I provide). Because there are many large files, I don't want to load them all into memory like this.
How can I get some Stream
or Iterator
that I read to process the files word-by-word? Ideally I'd get a separate Map<String, Object>
for each word or something similar.
I'm aware that YAML files can contain documents separated by ---
. However, I'd rather not separate each word in this way, and I'm already dividing the files into two documents (the first of which has some metadata).
Also, if this is possible with multiple libraries, does the nature of these files lend itself better to one over the others? Note: I'd prefer YAML 1.2.
Upvotes: 1
Views: 149
Reputation: 39678
SnakeYAML engine does have a low-level API for processing the YAML input as stream of events. This is taken from their test code and reads in the YAML input 444333
:
LoadSettings settings = LoadSettings.builder().build();
StreamReader reader = new StreamReader(settings, "444333");
ScannerImpl scanner = new ScannerImpl(settings, reader);
Parser parser = new ParserImpl(settings, scanner);
assertTrue(parser.hasNext());
assertEquals(Event.ID.StreamStart, parser.next().getEventId());
assertTrue(parser.hasNext());
assertEquals(Event.ID.DocumentStart, parser.next().getEventId());
assertTrue(parser.hasNext());
assertEquals(Event.ID.Scalar, parser.next().getEventId());
assertTrue(parser.hasNext());
assertEquals(Event.ID.DocumentEnd, parser.next().getEventId());
assertTrue(parser.hasNext());
assertEquals(Event.ID.StreamEnd, parser.next().getEventId());
assertFalse(parser.hasNext());
Parser implements Iterator<Event>
and thus fits your use-case.
By the way, Jackson uses this low-level API of SnakeYAML under the hood.
Upvotes: 2