Deedle - Distinct by column

Question

I had a situation the other day where a particular column of my Frame had some duplicate values.

I wanted to remove any rows where said column had a duplicate value.

I managed to hack a solution using a filter function, and while it was good enough for the exploratory data analysis at hand, it was way more painful that it should have been.

Despite searching high and low, I could not find any ideas on an elegant solution.

I also notices that Series don't offer a DistincyBy() or similar either.

How to do you do a "DistinctBy" operation for a specific column/s ?

Gebb · Accepted Answer

One way to do it is using nest and unnest, something like this:

let noDuplicates: Frame<(int*string), string> =
  df1
  |> Frame.groupRowsBy "Tomas"
  |> Frame.nest
  |> Series.mapValues (Frame.take 1)
  |> Frame.unnest

Let's explain each step. Imagine you have this dataframe:

// Create from individual observations (row * column * value)
let df1 = 
  [ ("Monday", "Tomas", 1); ("Tuesday", "Adam", 2)
    ("Tuesday", "Tomas", 4); ("Wednesday", "Tomas", -5)
    ("Thursday", "Tomas", 4); ("Thursday", "Adam", 5) ]
  |> Frame.ofValues

            Tomas Adam      
Monday    -> 1      
Tuesday   -> 4     2         
Wednesday -> -5     
Thursday  -> 4     5

And you want to remove rows containing duplicate values in the "Tomas" column.

First, group by this column.

let df2 : Frame<(int * string), string> = df1 |> Frame.groupRowsBy "Tomas"

                Tomas Adam      
1  Monday    -> 1      
4  Tuesday   -> 4     2         
4  Thursday  -> 4     5         
-5 Wednesday -> -5

Now you have a frame with a two-level index, which you can turn into a series of data frames.

let df3 = df2 |> Frame.nest

          Tomas Adam      
Monday -> 1      

            Tomas Adam 
Tuesday  -> 4     2    
Thursday -> 4     5    

             Tomas Adam      
Wednesday -> -5

Take the first row of each frame.

let df4 = df3 |> Series.mapValues (fun fr -> fr |> Frame.take 1)

          Tomas Adam      
Monday -> 1      

           Tomas Adam 
Tuesday -> 4     2    

             Tomas Adam      
Wednesday -> -5

It remains to perform the backwards conversion: from a series of data frames into a frame with a two-level index.

let df5 = df4 |> Frame.unnest

                Tomas Adam      
-5 Wednesday -> -5     
1  Monday    -> 1      
4  Tuesday   -> 4     2

Deedle - Distinct by column

Answers (2)

Related Questions