DJElbow
DJElbow

Reputation: 3463

Create multiple columns from single Hive UDF

I am using Amazon EMR and Hive 0.11. I am trying to create a Hive UDF that will return multiple columns from one UDF call.

For example, I would like to call a UDF like the one below and be returned several (named) columns.

SELECT get_data(columnname) FROM table;

I am having trouble finding documentation of this being done, but have heard it is possible if using a Generic UDF. Does anyone know what needs to be returned from the evaluate() method for this to work?

Upvotes: 4

Views: 4101

Answers (1)

dknight
dknight

Reputation: 679

I just use GenericUDTF.After you write a udf extends of GenericUDTF, your udtf should implements the two important method:initialize and evaluate.

  • In initialize, you can check the argument type and set the return object type. For example, with ObjectInspectorFactory.getStandardStructObjectInspector, you specify the output columns with the name from structFieldNames argument and the column value type from structFieldObjectInspectors). The output columns size is the size of structFieldNames list. There are two type system:java and hadoop. The ObjectInspector of java is begein with javaXXObjectInspector, otherwise it starts with writableXXObjectInspector.
  • In process, it is similar to the common udf. Except that, you should use the ObjectInspector which is saved from initialize() to convert the Object to concrete value such as String, Integer and etc. Call forward function to output a row. In the row object forwardColObj, you can specific the columns object.

The following is simple example:


public class UDFExtractDomainMethod extends GenericUDTF {

    private static final Integer OUT_COLS = 2;
    //the output columns size
    private transient Object forwardColObj[] = new Object[OUT_COLS];

    private transient ObjectInspector[] inputOIs;

    /**
    *
    * @param argOIs check the argument is valid.
    * @return the output column structure.
    * @throws UDFArgumentException
    */
    @Override
    public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
        if (argOIs.length != 1 || argOIs[0].getCategory() != ObjectInspector.Category.PRIMITIVE
                || !argOIs[0].getTypeName().equals(serdeConstants.STRING_TYPE_NAME)) {
            throw new UDFArgumentException("split_url only take one argument with type of string");
        }

        inputOIs = argOIs;
        List<String> outFieldNames = new ArrayList<String>();
        List<ObjectInspector> outFieldOIs = new ArrayList<ObjectInspector>();
        outFieldNames.add("host");
        outFieldNames.add("method");
        outFieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        //writableStringObjectInspector correspond to hadoop.io.Text
        outFieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        return  ObjectInspectorFactory.getStandardStructObjectInspector(outFieldNames, outFieldOIs);
    }

    @Override
    public void process(Object[] objects) throws HiveException {
        try {
            //need OI to convert data type to get java type
            String inUrl = ((StringObjectInspector)inputOIs[0]).getPrimitiveJavaObject(objects[0]);
            URI uri = new URI(inUrl);
            forwardColObj[0] = uri.getHost();
            forwardColObj[1] = uri.getRawPath();
            //output a row with two column
            forward(forwardColObj);
        } catch (URISyntaxException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void close() throws HiveException {

    }
}

Upvotes: 3

Related Questions