Kamal
Kamal

Reputation: 3158

Bulk loading json to HBase using pig

Hi I am looking for a way to load large number of json documents one per line

Each line is of the format:

'{id :"id123", "c1":"v1", "c2":"v2", "c3" :"v3"...}'

Each json document can have unknown number of fields. Is there a way to do this in pig? I want to load the fields into separate columns on hbase.

Upvotes: 0

Views: 312

Answers (1)

kevad
kevad

Reputation: 419

You probably want to use a udf. For instance with python and what you provided:

UDF:

from com.xhaus.jyson import JysonCodec as json
from com.xhaus.jyson import JSONDecodeError


@outputSchema(
    "rels:{t:('{id:chararray,c1:chararray, c2:chararray, c3:chararray...}')}"
)
def parse_json(line):
    try:
        parsed_json = json.loads(line)
    except:
       return None
    return tuple(parsed_json.values())

Pig:

REGISTER 'path-to-udf.py' USING jython AS py_udf ;

raw_data = LOAD 'path-to-your-data'
           USING PigStorage('\n')
           AS (line:chararray) ;

-- Parse lines using UDF
parsed_data = FOREACH cleanRawLogs GENERATE FLATTEN(py_f.parse_json(line)) ;

Upvotes: 0

Related Questions