Reputation: 917
I am new to working with json data on hive. I am working on a spark application that gets json data and stores it into hive tables. I have a json like this:
which looks like this when expanded:
I am able to read the json into a dataframe and save it in a location on HDFS. But getting hive to be able to read the data is the tough part.
After I've searched online for example, I've tried to do this:
using the STRUCT
for all the json fields and then access the elements using column.element
.
For example:
web_app_security
will be the name of a column(of type STRUCT
) inside the table and the other jsons in it like config_web_cms_authentication, web_threat_intel_alert_external
will also be Structs(with rating
and rating_numeric
as the fields).
I tried creating the table with json serde. here is my table definition:
CREATE EXTERNAL TABLE jsons (
web_app_security struct<config_web_cms_authentication: struct<rating: string, rating_numeric: float>, web_threat_intel_alert_external: struct<rating: string, rating_numeric: float>, web_http_security_headers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
dns_security struct<domain_hijacking_protection: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, dns_hosting_providers: struct<rating:string, rating_numeric: float>>,
email_security struct<rating: string, email_encryption_enabled: struct<rating: string, rating_numeric: float>, rating_numeric: float, email_hosting_providers: struct<rating: string, rating_numeric: float>, email_authentication: struct<rating: string, rating_numeric: float>>,
threat_intell struct<rating: string, threat_intel_alert_internal_3: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_1: struct<rating: string, rating_numeric: float>, rating_numeric: float, threat_intel_alert_internal_12: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_6: struct<rating: string, rating_numeric: float>>,
data_loss struct<data_loss_6: struct<rating: string, rating_numeric: float>, rating: string, data_loss_36plus: struct<rating: string, rating_numeric: float>, rating_numeric: float, data_loss_36: struct<rating: string, rating_numeric: float>, data_loss_12: struct<rating: string, rating_numeric: float>, data_loss_24: struct<rating: string, rating_numeric: float>>,
system_hosting struct<host_hosting_providers: struct<rating: string, rating_numeric: float>, hosting_countries: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
defensibility struct<attack_surface_web_ip: struct<rating: string, rating_numeric: float>, shared_hosting: struct<rating: string, rating_numeric: float>, defensibility_hosting_providers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, attack_surface_web_hostname: struct<rating: string, rating_numeric: float>>,
software_patching struct<patching_web_cms: struct<rating: string, rating_numeric: float>, rating: string, patching_web_server: struct<rating: string, rating_numeric: float>, patching_vuln_open_ssl: struct<rating: string, rating_numeric: float>, patching_app_server: struct<rating: string, rating_numeric: float>, rating_numeric: float>,
governance struct<governance_customer_base: struct<rating: string, rating_numeric: float>, governance_security_certifications: struct<rating: string, rating_numeric: float>, governance_regulatory_requirements: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>
)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS orc
LOCATION 'hdfs://nameservice1/data/gis/final/rr_current_analysis'
I've tried to parse the rows with the json serde. After I've saved some data to the table, I get the following error when I try to query it:
Error: java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text (state=,code=0)
I am not sure if I am doing it the right way.
I am open to any other ways of storing the data into the table as well. Any help would be appreciated. Thank you.
Upvotes: 0
Views: 3189
Reputation: 1182
That's because you're mixing ORC as a storage (STORED AS orc
) and JSON as a SerDe (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
) overriding ORC's default OrcSerde
SerDe, but not input (OrcInputFormat
) and output (OrcOutputFormat
) formats.
You either need to use ORC storage without overriding its default SerDe. In this case, make sure that your Spark application writes into ORC tables, not JSON.
Or, if you want data to be stored in JSON, use JsonSerDe
together with a plain text file as a storage (STORED AS TEXTFILE
).
Hive Developer Guide has explanation on how SerDe and Storage work -https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe
Upvotes: 1