Retrieve arbitrary row for unique combination of columns in a dataframe

Question

I have the following data in a dataframe

col1    col2    col3    col4
1       desc1    v1      v3
2       desc2    v4      v2
1       desc1    v4      v2
2       desc2    v1      v3

I need only the first row of each unique combination of col1,col2 like below

Expected Output:

col1    col2    col3    col4
1       desc1    v1      v3
2       desc2    v4      v2

How can I achieve this in pyspark (version 1.3.1)?

I tried and achieved the same by converting the dataframe into an rdd and then applying map and reduceByKey functions and then converting back the result rdd into dataframe. Is there any other way to perform the above operation using dataframe functions?

Retrieve arbitrary row for unique combination of columns in a dataframe

Answers (1)

Related Questions