How to drop duplicates in a python datatable h2oai

Question

The datatable package in python (https://github.com/h2oai/datatable/) can count the number of unique values in a column, Is there a way to drop the duplicates values with this package or I have to use the slow pandas package?

Pasha · Accepted Answer

If you want to find the unique values in a single column, then you can use function dt.unique(), which takes a column and returns a new column with all unique values from the original:

>>> import datatable as dt
>>> DT = dt.Frame(A=[1, 3, 2, 1, 4, 2, 1], B=list("ABCDEFG"))
>>> dt.unique(DT["A"])
   |  A
-- + --
 0 |  1
 1 |  2
 2 |  3
 3 |  4

[4 rows x 1 column]

If, on the other hand, you have a multi-column Frame and you want to only keep rows with the unique values in one of the columns, then this is equivalent to grouping by that column, and can be approached as such:

>>> from datatable import f, by, first
>>> DT[:, first(f[1:]), by(f[0])]
   |  A  B 
-- + --  --
 0 |  1  A 
 1 |  2  C 
 2 |  3  B 
 3 |  4  E 

[4 rows x 2 columns]

How to drop duplicates in a python datatable h2oai

Answers (1)

Related Questions