Hemanth Annavarapu
Hemanth Annavarapu

Reputation: 917

java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text. Error with json serde

I am new to working with json data on hive. I am working on a spark application that gets json data and stores it into hive tables. I have a json like this:

Json of Jsons

which looks like this when expanded:

hierarchy

I am able to read the json into a dataframe and save it in a location on HDFS. But getting hive to be able to read the data is the tough part.

After I've searched online for example, I've tried to do this:

using the STRUCT for all the json fields and then access the elements using column.element.

For example:

web_app_security will be the name of a column(of type STRUCT) inside the table and the other jsons in it like config_web_cms_authentication, web_threat_intel_alert_external will also be Structs(with rating and rating_numeric as the fields).

I tried creating the table with json serde. here is my table definition:

CREATE EXTERNAL TABLE jsons (
web_app_security struct<config_web_cms_authentication: struct<rating: string, rating_numeric: float>, web_threat_intel_alert_external: struct<rating: string, rating_numeric: float>, web_http_security_headers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
dns_security struct<domain_hijacking_protection: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, dns_hosting_providers: struct<rating:string, rating_numeric: float>>,
email_security struct<rating: string, email_encryption_enabled: struct<rating: string, rating_numeric: float>, rating_numeric: float, email_hosting_providers: struct<rating: string, rating_numeric: float>, email_authentication: struct<rating: string, rating_numeric: float>>,
threat_intell struct<rating: string, threat_intel_alert_internal_3: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_1: struct<rating: string, rating_numeric: float>, rating_numeric: float,  threat_intel_alert_internal_12: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_6: struct<rating: string, rating_numeric: float>>,
data_loss struct<data_loss_6: struct<rating: string, rating_numeric: float>, rating: string, data_loss_36plus: struct<rating: string, rating_numeric: float>, rating_numeric: float,  data_loss_36: struct<rating: string, rating_numeric: float>, data_loss_12: struct<rating: string, rating_numeric: float>, data_loss_24: struct<rating: string, rating_numeric: float>>,
system_hosting struct<host_hosting_providers: struct<rating: string, rating_numeric: float>,  hosting_countries: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
defensibility struct<attack_surface_web_ip: struct<rating: string, rating_numeric: float>, shared_hosting: struct<rating: string, rating_numeric: float>, defensibility_hosting_providers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, attack_surface_web_hostname: struct<rating: string, rating_numeric: float>>,
software_patching struct<patching_web_cms: struct<rating: string, rating_numeric: float>, rating: string, patching_web_server: struct<rating: string, rating_numeric: float>, patching_vuln_open_ssl: struct<rating: string, rating_numeric: float>, patching_app_server: struct<rating: string, rating_numeric: float>, rating_numeric: float>,
governance struct<governance_customer_base: struct<rating: string, rating_numeric: float>, governance_security_certifications: struct<rating: string, rating_numeric: float>, governance_regulatory_requirements: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>
)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS orc
LOCATION 'hdfs://nameservice1/data/gis/final/rr_current_analysis'

I've tried to parse the rows with the json serde. After I've saved some data to the table, I get the following error when I try to query it:

Error: java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text (state=,code=0)

I am not sure if I am doing it the right way.

I am open to any other ways of storing the data into the table as well. Any help would be appreciated. Thank you.

Upvotes: 0

Views: 3189

Answers (1)

Sergey Khudyakov
Sergey Khudyakov

Reputation: 1182

That's because you're mixing ORC as a storage (STORED AS orc) and JSON as a SerDe (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe') overriding ORC's default OrcSerde SerDe, but not input (OrcInputFormat) and output (OrcOutputFormat) formats.

You either need to use ORC storage without overriding its default SerDe. In this case, make sure that your Spark application writes into ORC tables, not JSON.

Or, if you want data to be stored in JSON, use JsonSerDe together with a plain text file as a storage (STORED AS TEXTFILE).


Hive Developer Guide has explanation on how SerDe and Storage work -https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe

Upvotes: 1

Related Questions