Reputation: 781
I have a query which fails when reducing, the error which is thrown is:
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
However, when going deeper into the YARN logs, I was able to find this:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"2020-05-05","reducesinkkey1":10039,"reducesinkkey2":103,"reducesinkkey3":"2020-05-05","reducesinkkey4":10039,"reducesinkkey5":103},"value":{"_col0":103,"_col1":["1","2"]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"2020-05-05","reducesinkkey1":10039,"reducesinkkey2":103,"reducesinkkey3":"2020-05-05","reducesinkkey4":10039,"reducesinkkey5":103},"value":{"_col0":103,"_col1":["1","2"]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:253) ... 7 more Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text
The most relevant part being:
java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text
This is the query which I'm executing (FYI: this is a subquery within a bigger query):
SELECT
yyyy_mm_dd,
h_id,
MAX(CASE WHEN rn=1 THEN prov_id ELSE NULL END) OVER (partition by yyyy_mm_dd, h_id) as primary_prov,
collect_set(api) OVER (partition by yyyy_mm_dd, h_id, p_id) prov_id_api, --re-assemple array to include all elements from multiple initial arrays if there are different arrays per prov_id
prov_id
FROM(
SELECT --get "primary prov" (first element in ascending array))
yyyy_mm_dd,
h_id,
prov_id,
api,
ROW_NUMBER() OVER(PARTITION BY yyyy_mm_dd, h_id ORDER BY api) rn
FROM(
SELECT --explode array to get data at row level
t.yyyy_mm_dd,
t.h_id,
prov_id,
collect_set(--array of integers, use set to remove duplicates
CASE
WHEN e.apis_xml_element = 'res' THEN 1
WHEN e.apis_xml_element = 'av' THEN 2
...
...
ELSE e.apis_xml_element
END) as api
FROM
mytable t
LATERAL VIEW EXPLODE(apis_xml) e AS apis_xml_element
WHERE
yyyy_mm_dd = "2020-05-05"
AND t.apis_xml IS NOT NULL
GROUP BY
1,2,3
)s
)s
I have further narrowed the issue down to the top level select, as the inner select works fine by itself, which makes me believe the issue is happening here specifically:
collect_set(api) OVER (partition by yyyy_mm_dd, h_id, prov_id) prov_id_api
However, I'm unsure how to solve it. At the most inner select, apis_xml
is an array<string>
which holds strings such as 'res' and 'av' up until a point. Then integers are used. Hence the case statement to align these.
Strangely, if I run this via Spark i.e. spark.sql=(above_query)
, it works. However, on beeline via HQL, the job gets killed.
Upvotes: 0
Views: 1221
Reputation: 38290
Remove collect_set
in the inner query, because it already produces array, upper collect_set
should receive scalars. Also remove group by
in the inner query, because without collect_set there is no aggregation any more. You can use DISTINCT if you need to remove duplicates
Upvotes: 1