OrdinaryOrange
OrdinaryOrange

Reputation: 2712

Deedle - Distinct by column

I had a situation the other day where a particular column of my Frame had some duplicate values.

I wanted to remove any rows where said column had a duplicate value.

I managed to hack a solution using a filter function, and while it was good enough for the exploratory data analysis at hand, it was way more painful that it should have been.

Despite searching high and low, I could not find any ideas on an elegant solution.

I also notices that Series don't offer a DistincyBy() or similar either.

How to do you do a "DistinctBy" operation for a specific column/s ?

Upvotes: 2

Views: 183

Answers (2)

Anibal Yeh
Anibal Yeh

Reputation: 351

#r "nuget: Deedle, 3.0.0"
#load "Deedle.fsx"
open Deedle
let inline distinctFrame (frame: Frame<'R, 'C>) (keys: 'C seq) (distColId:'C) =

    let idxSource = 
        frame
        |> Frame.mapRows (fun (i:'R) r ->
            i, keys |> Seq.map r.TryGet |> Seq.toArray
        )
        |> fun s -> s.Values

    let idx = 
        idxSource
        |> Seq.groupBy (fun (_, g) -> g)
        |> Seq.map (fun g -> 
            let (idx, _) = snd g |> Seq.item 0
            idx
            )
        |> Seq.distinct
        |> fun s -> Frame([distColId], [Series(s, s)])//"____distinctIdx____"

    let fmj = frame.Join(idx, kind=JoinKind.Inner)
    fmj.Columns[fmj.ColumnKeys |> Seq.filter (fun v -> v <> distColId)]



open System.IO


let data = "A;B\na;1\nb;2\nb;2\nc;3"

let bytes = System.Text.Encoding.UTF8.GetBytes data
let stream =  new MemoryStream( bytes )

let df= 
    Frame.ReadCsv(
        stream = stream,
        separators = ";",
        hasHeaders = true
    )


distinctFrame df ["A";"B"] "____distinctIdx____"

Result:

val data: string = "A;B
a;1
b;2
b;2
c;3"
val bytes: byte array =
  [|65uy; 59uy; 66uy; 10uy; 97uy; 59uy; 49uy; 10uy; 98uy; 59uy; 50uy; 10uy;
    98uy; 59uy; 50uy; 10uy; 99uy; 59uy; 51uy|]
val stream: IO.MemoryStream
val df: Frame<int,string> =
  
     A B 
0 -> a 1 
1 -> b 2 
2 -> b 2 
3 -> c 3 

4 rows x 2 columns
0 missing values
val it: Frame<int,string> =
  
     A B 
0 -> a 1 
1 -> b 2 
3 -> c 3 

3 rows x 2 columns
0 missing values

Upvotes: 0

Gebb
Gebb

Reputation: 6556

One way to do it is using nest and unnest, something like this:

let noDuplicates: Frame<(int*string), string> =
  df1
  |> Frame.groupRowsBy "Tomas"
  |> Frame.nest
  |> Series.mapValues (Frame.take 1)
  |> Frame.unnest

Let's explain each step. Imagine you have this dataframe:

// Create from individual observations (row * column * value)
let df1 = 
  [ ("Monday", "Tomas", 1); ("Tuesday", "Adam", 2)
    ("Tuesday", "Tomas", 4); ("Wednesday", "Tomas", -5)
    ("Thursday", "Tomas", 4); ("Thursday", "Adam", 5) ]
  |> Frame.ofValues
            Tomas Adam      
Monday    -> 1     <missing> 
Tuesday   -> 4     2         
Wednesday -> -5    <missing> 
Thursday  -> 4     5     

And you want to remove rows containing duplicate values in the "Tomas" column.

First, group by this column.

let df2 : Frame<(int * string), string> = df1 |> Frame.groupRowsBy "Tomas"
                Tomas Adam      
1  Monday    -> 1     <missing> 
4  Tuesday   -> 4     2         
4  Thursday  -> 4     5         
-5 Wednesday -> -5    <missing> 

Now you have a frame with a two-level index, which you can turn into a series of data frames.

let df3 = df2 |> Frame.nest
          Tomas Adam      
Monday -> 1     <missing> 

            Tomas Adam 
Tuesday  -> 4     2    
Thursday -> 4     5    

             Tomas Adam      
Wednesday -> -5    <missing> 

Take the first row of each frame.

let df4 = df3 |> Series.mapValues (fun fr -> fr |> Frame.take 1)
          Tomas Adam      
Monday -> 1     <missing> 

           Tomas Adam 
Tuesday -> 4     2    

             Tomas Adam      
Wednesday -> -5    <missing> 

It remains to perform the backwards conversion: from a series of data frames into a frame with a two-level index.

let df5 = df4 |> Frame.unnest
                Tomas Adam      
-5 Wednesday -> -5    <missing> 
1  Monday    -> 1     <missing> 
4  Tuesday   -> 4     2         

Upvotes: 2

Related Questions