Someguywhocodes
Someguywhocodes

Reputation: 781

HQL throws ArrayList cannot be cast to org.apache.hadoop.io.Text

I have a query which fails when reducing, the error which is thrown is:

Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

However, when going deeper into the YARN logs, I was able to find this:

Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"2020-05-05","reducesinkkey1":10039,"reducesinkkey2":103,"reducesinkkey3":"2020-05-05","reducesinkkey4":10039,"reducesinkkey5":103},"value":{"_col0":103,"_col1":["1","2"]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"2020-05-05","reducesinkkey1":10039,"reducesinkkey2":103,"reducesinkkey3":"2020-05-05","reducesinkkey4":10039,"reducesinkkey5":103},"value":{"_col0":103,"_col1":["1","2"]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:253) ... 7 more Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text

The most relevant part being:

java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text

This is the query which I'm executing (FYI: this is a subquery within a bigger query):

SELECT
    yyyy_mm_dd,
    h_id,
    MAX(CASE WHEN rn=1 THEN prov_id ELSE NULL END) OVER (partition by yyyy_mm_dd, h_id) as primary_prov,
    collect_set(api) OVER (partition by yyyy_mm_dd, h_id, p_id) prov_id_api, --re-assemple array to include all elements from multiple initial arrays if there are different arrays per prov_id
    prov_id
FROM(
    SELECT --get "primary prov" (first element in ascending array))
        yyyy_mm_dd,
        h_id,
        prov_id,
        api,
        ROW_NUMBER() OVER(PARTITION BY yyyy_mm_dd, h_id ORDER BY api) rn
    FROM(
        SELECT --explode array to get data at row level
            t.yyyy_mm_dd,
            t.h_id,
            prov_id,
            collect_set(--array of integers, use set to remove duplicates
                CASE
                    WHEN e.apis_xml_element = 'res' THEN 1
                    WHEN e.apis_xml_element = 'av'  THEN 2
                    ...
                    ...
                    ELSE e.apis_xml_element
                END) as api
        FROM
            mytable t
            LATERAL VIEW EXPLODE(apis_xml) e AS apis_xml_element
        WHERE
            yyyy_mm_dd = "2020-05-05"
            AND t.apis_xml IS NOT NULL
        GROUP BY
            1,2,3
        )s
    )s

I have further narrowed the issue down to the top level select, as the inner select works fine by itself, which makes me believe the issue is happening here specifically:

collect_set(api) OVER (partition by yyyy_mm_dd, h_id, prov_id) prov_id_api

However, I'm unsure how to solve it. At the most inner select, apis_xml is an array<string> which holds strings such as 'res' and 'av' up until a point. Then integers are used. Hence the case statement to align these.

Strangely, if I run this via Spark i.e. spark.sql=(above_query), it works. However, on beeline via HQL, the job gets killed.

Upvotes: 0

Views: 1221

Answers (1)

leftjoin
leftjoin

Reputation: 38290

Remove collect_set in the inner query, because it already produces array, upper collect_set should receive scalars. Also remove group by in the inner query, because without collect_set there is no aggregation any more. You can use DISTINCT if you need to remove duplicates

Upvotes: 1

Related Questions