Reputation: 3158
Hi I am looking for a way to load large number of json documents one per line
Each line is of the format:
'{id :"id123", "c1":"v1", "c2":"v2", "c3" :"v3"...}'
Each json document can have unknown number of fields. Is there a way to do this in pig? I want to load the fields into separate columns on hbase.
Upvotes: 0
Views: 312
Reputation: 419
You probably want to use a udf. For instance with python and what you provided:
UDF:
from com.xhaus.jyson import JysonCodec as json
from com.xhaus.jyson import JSONDecodeError
@outputSchema(
"rels:{t:('{id:chararray,c1:chararray, c2:chararray, c3:chararray...}')}"
)
def parse_json(line):
try:
parsed_json = json.loads(line)
except:
return None
return tuple(parsed_json.values())
Pig:
REGISTER 'path-to-udf.py' USING jython AS py_udf ;
raw_data = LOAD 'path-to-your-data'
USING PigStorage('\n')
AS (line:chararray) ;
-- Parse lines using UDF
parsed_data = FOREACH cleanRawLogs GENERATE FLATTEN(py_f.parse_json(line)) ;
Upvotes: 0