Splitting .txt file into elements

Question

Since we can not read directly from a Json File, I'm using a .txt one. It looks like that with more elements seperated by ",".

[
  {
    "Item_Identifier": "FDW58", 
    "Outlet_Size": "Medium"

  },
  {
    "Item_Identifier": "FDW14",
    "Outlet_Size": "Small"
  },
]

I want to count the number of elements, here I would get 2. The problem is that I can't seperate the text into elements separated with a comma ','. I get every line alone even if I convert it into a json format.

lines = p | 'receive_data' >> beam.io.ReadFromText(
    known_args.input)\
    | 'jsondumps' >> beam.Map(lambda x: json.dumps(x))\
    | 'jsonloads' >> beam.Map(lambda x: json.loads(x))\
    | 'print' >> beam.ParDo(PrintFn()) \

drobert · Accepted Answer

I don't believe this is a safe approach. I haven't used the python sdk (I use java) but io.TextIO on the java side is pretty clear that it will emit a PCollection where each element is one line of input from the source file. Hierarchical data formats (json, xml, etc.) are not amendable to being split apart in this way.

If you file is as well-formatted, and non-nested as the json you've included, you can get away with:

reading the file by line (as I believe you're doing)
filtering for only lines containing }
counting the resulting pcollection size

To integrate with json more generally, though, we took a different approach:

start with a PCollection of Strings where each value is the path to a file
use native libraries to access the file and parse it in a streaming fashion (we use scala which has a few streaming json parsing libraries available)
- alternatively, use Beam's APIs to obtain a ReadableFile instance from a MatchResult and access the file through that

My understanding has been that not all file formats are good matches for distributed processors. Gzip for example cannot be 'split' or chunked easily. Same with JSON. CSV has an issue where the values are nonsense unless you also have the opening line handy.

Splitting .txt file into elements

Answers (2)

Related Questions