Reputation: 23
We are trying to write UDFs of Hive in Python to clean the data. The UDF we tried was using Pandas and it is throwing the error.
When we try using another python code without the Pandas it is working fine. Kindly help to understand the problem. Providing Pandas code below:
We have already tried various ways of Pandas but unfortunately no luck. As the other Python code without Pandas is working fine,we are confused why is it failing?
import sys
import pandas as pd
import numpy as np
for line in sys.stdin:
df = line.split('\t')
df1 = pd.DataFrame(df)
df2=df1.T
df2[0] = np.where(df2[0].str.isalpha(), df2[0], np.nan)
df2[1] = np.where(df2[1].astype(str).str.isdigit(), df2[1], np.nan)
df2[2] = np.where(df2[2].astype(str).str.len() != 10, np.nan,
df2[2].astype(str))
#df2[3] = np.where(df2[3].astype(str).str.isdigit(), df2[3], np.nan)
df2 = df2.dropna()
print(df2)
I get this error:
FAILED: Execution Error, return code 20003 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. An error occurred when trying to close the Operator running your custom script.
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Upvotes: 0
Views: 224
Reputation: 1126
I think you'll need to look at the detailed job logs for more information. My first guess is that Pandas is not installed on a data node.
This answer looks appropriate for you if you intend to bundle dependencies with your job: https://stackoverflow.com/a/2869974/7379644
Upvotes: 0