Reputation: 1855
Since we can not read directly from a Json File, I'm using a .txt one. It looks like that with more elements seperated by ",".
[
{
"Item_Identifier": "FDW58",
"Outlet_Size": "Medium"
},
{
"Item_Identifier": "FDW14",
"Outlet_Size": "Small"
},
]
I want to count the number of elements, here I would get 2. The problem is that I can't seperate the text into elements separated with a comma ','. I get every line alone even if I convert it into a json format.
lines = p | 'receive_data' >> beam.io.ReadFromText(
known_args.input)\
| 'jsondumps' >> beam.Map(lambda x: json.dumps(x))\
| 'jsonloads' >> beam.Map(lambda x: json.loads(x))\
| 'print' >> beam.ParDo(PrintFn()) \
Upvotes: 1
Views: 347
Reputation: 1310
I don't believe this is a safe approach. I haven't used the python sdk (I use java) but io.TextIO
on the java side is pretty clear that it will emit a PCollection where each element is one line of input from the source file. Hierarchical data formats (json, xml, etc.) are not amendable to being split apart in this way.
If you file is as well-formatted, and non-nested as the json you've included, you can get away with:
}
To integrate with json more generally, though, we took a different approach:
ReadableFile
instance from a MatchResult
and access the file through thatMy understanding has been that not all file formats are good matches for distributed processors. Gzip for example cannot be 'split' or chunked easily. Same with JSON. CSV has an issue where the values are nonsense unless you also have the opening line handy.
Upvotes: 1
Reputation: 7443
A .json
file is simply a text file whose contents are in JSON-parseable format.
Your JSON is invalid because it has trailing commas. This should work:
import json
j = r"""
[
{
"Item_Identifier": "FDW58",
"Outlet_Size": "Medium"
},
{
"Item_Identifier": "FDW14",
"Outlet_Size": "Small"
}
]
"""
print(json.loads(j))
Upvotes: 1