ScottFree
ScottFree

Reputation: 632

Data issue with Hive and JsonSerDe

I'm using JSON formatted twitter data and created my Hive structure to store the data. I am also using the SerDe org.openx.data.jsonserde.JsonSerDe to serialize/deserialize each row.

I have one problem column which is the geo column (the most important column for my application as it happens). The structure for this column is as follows (complete structure at the bottom):

geo struct<coordinates:array<double>, type:string>

This works fine for rows with the correct data: "geo":{"type":"Point","coordinates":[0.123337,0.955139]}

however a large percentage of my data contains the following for the geo column: "geo":"None"

which causes the following SerDe error: Data is not JSONObject but java.lang.String with value None

Formatting my data is not really an option as there is nearly 1 terabyte of raw text files so I'd like to avoid this if possible!

Perhaps I need to write my own SerDe to cater for this but does anyone know if I can do a cast or something similar for non populated geo data?

Full table structure:

CREATE TABLE tweets (
    coordinates struct<coordinates:array<double>, type:string>,
    created_at string,
    favorite_count int,
    geo struct<coordinates:array<double>, type:string>,
    lang string,
    place struct<attributes:struct<street_address:string>, bounding_box:struct<coordinates:array<array<array<double>>>, type:string>, country:string, country_code:string, full_name:string, id:string, name:string, place_type:string, url:string>,
    retweet_count int,
    source string,
    text string,
    timestamp_ms timestamp)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties ("ignore.malformed.json" = "true")
location '/user/hue/tweets/';

Upvotes: 1

Views: 1968

Answers (1)

Ramzy
Ramzy

Reputation: 7138

The issue seems to be not the non populated data. Because, the correct formatted json with no data would be like this {"type":null,"coordinates":null}, which should be picked up by serde, and parse accordingly.

And the "ignore.malformed.json", will only take care care of incorrect formatting at json level, and therby skipping the record as per documentation.

So given your situation, the source of json creation is the actual culprit sending out an altogether different format when there is no data. I assume you wont be able to change it(but if you can, it would be great). So the final solution seems to be writing your own serde, as there dont seem to be anyproperty to handle this scenarion, where a new datatype is coming(String instead of defined json).

Its a good usecase, resulting due to lack of understanding of different layers. Even i wait for any answer other than your own serde,on this aspect.

Upvotes: 2

Related Questions