Rim
Rim

Reputation: 1855

Splitting .txt file into elements

Since we can not read directly from a Json File, I'm using a .txt one. It looks like that with more elements seperated by ",".

[
  {
    "Item_Identifier": "FDW58", 
    "Outlet_Size": "Medium"

  },
  {
    "Item_Identifier": "FDW14",
    "Outlet_Size": "Small"
  },
]

I want to count the number of elements, here I would get 2. The problem is that I can't seperate the text into elements separated with a comma ','. I get every line alone even if I convert it into a json format.

lines = p | 'receive_data' >> beam.io.ReadFromText(
    known_args.input)\
    | 'jsondumps' >> beam.Map(lambda x: json.dumps(x))\
    | 'jsonloads' >> beam.Map(lambda x: json.loads(x))\
    | 'print' >> beam.ParDo(PrintFn()) \

Upvotes: 1

Views: 347

Answers (2)

drobert
drobert

Reputation: 1310

I don't believe this is a safe approach. I haven't used the python sdk (I use java) but io.TextIO on the java side is pretty clear that it will emit a PCollection where each element is one line of input from the source file. Hierarchical data formats (json, xml, etc.) are not amendable to being split apart in this way.

If you file is as well-formatted, and non-nested as the json you've included, you can get away with:

  • reading the file by line (as I believe you're doing)
  • filtering for only lines containing }
  • counting the resulting pcollection size

To integrate with json more generally, though, we took a different approach:

  • start with a PCollection of Strings where each value is the path to a file
  • use native libraries to access the file and parse it in a streaming fashion (we use scala which has a few streaming json parsing libraries available)
    • alternatively, use Beam's APIs to obtain a ReadableFile instance from a MatchResult and access the file through that

My understanding has been that not all file formats are good matches for distributed processors. Gzip for example cannot be 'split' or chunked easily. Same with JSON. CSV has an issue where the values are nonsense unless you also have the opening line handy.

Upvotes: 1

alex
alex

Reputation: 7443

A .json file is simply a text file whose contents are in JSON-parseable format.

Your JSON is invalid because it has trailing commas. This should work:

import json

j = r"""
[
    {
        "Item_Identifier": "FDW58",
        "Outlet_Size": "Medium"
    },
    {
        "Item_Identifier": "FDW14",
        "Outlet_Size": "Small"
    }
]
"""

print(json.loads(j))

Upvotes: 1

Related Questions