Reputation: 1868
I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it.
This is the sample code to read JSON. Is this correct way of doing it?
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json"));
System.out.println("lines: " + lines);
or I should use,
p.apply(FileIO.match().filepattern("/Users/xyz/eclipse-workspace/beam-prototype/test.json"))
I just need to read the below json file. Read the complete testdata
from this file and then Stream it.
{
“testdata":{
“siteOwner”:”xxx”,
“siteInfo”:{
“siteID”:”id_member",
"siteplatform”:”web”,
"siteType”:”soap”,
"siteURL”:”www”,
}
}
}
The above code is not reading the json file, it is printing like
lines: ReadMyFile/Read.out [PCollection]
, could you please guide me with sample reference?
Upvotes: 1
Views: 2024
Reputation: 2621
This is the sample code to read JSON. Is this correct way of doing it?
To quickly answer your question, yes. Your sample code is the correct way to read a file containing JSON, where each line of the file contains a single JSON element. The TextIO
input transform reads a file line by line, so if a single JSON element spans multiple lines, then it will not be parseable.
The second code sample has the same effect.
The above code is not reading the json file, it is printing like
The printed result is expected. The variable lines
does not actually contain the JSON strings in the file. lines
is a PCollection
of String
s; it simply represents the state of the pipeline after a transform is applied. Accessing elements in the pipeline can be done by applying subsequent transforms. The actual JSON string can be access in the implementation of a transform.
Upvotes: 1