Reputation: 2712
I had a situation the other day where a particular column of my Frame
had some duplicate values.
I wanted to remove any rows where said column had a duplicate value.
I managed to hack a solution using a filter function, and while it was good enough for the exploratory data analysis at hand, it was way more painful that it should have been.
Despite searching high and low, I could not find any ideas on an elegant solution.
I also notices that Series
don't offer a DistincyBy()
or similar either.
How to do you do a "DistinctBy" operation for a specific column/s ?
Upvotes: 2
Views: 183
Reputation: 351
#r "nuget: Deedle, 3.0.0"
#load "Deedle.fsx"
open Deedle
let inline distinctFrame (frame: Frame<'R, 'C>) (keys: 'C seq) (distColId:'C) =
let idxSource =
frame
|> Frame.mapRows (fun (i:'R) r ->
i, keys |> Seq.map r.TryGet |> Seq.toArray
)
|> fun s -> s.Values
let idx =
idxSource
|> Seq.groupBy (fun (_, g) -> g)
|> Seq.map (fun g ->
let (idx, _) = snd g |> Seq.item 0
idx
)
|> Seq.distinct
|> fun s -> Frame([distColId], [Series(s, s)])//"____distinctIdx____"
let fmj = frame.Join(idx, kind=JoinKind.Inner)
fmj.Columns[fmj.ColumnKeys |> Seq.filter (fun v -> v <> distColId)]
open System.IO
let data = "A;B\na;1\nb;2\nb;2\nc;3"
let bytes = System.Text.Encoding.UTF8.GetBytes data
let stream = new MemoryStream( bytes )
let df=
Frame.ReadCsv(
stream = stream,
separators = ";",
hasHeaders = true
)
distinctFrame df ["A";"B"] "____distinctIdx____"
Result:
val data: string = "A;B
a;1
b;2
b;2
c;3"
val bytes: byte array =
[|65uy; 59uy; 66uy; 10uy; 97uy; 59uy; 49uy; 10uy; 98uy; 59uy; 50uy; 10uy;
98uy; 59uy; 50uy; 10uy; 99uy; 59uy; 51uy|]
val stream: IO.MemoryStream
val df: Frame<int,string> =
A B
0 -> a 1
1 -> b 2
2 -> b 2
3 -> c 3
4 rows x 2 columns
0 missing values
val it: Frame<int,string> =
A B
0 -> a 1
1 -> b 2
3 -> c 3
3 rows x 2 columns
0 missing values
Upvotes: 0
Reputation: 6556
One way to do it is using nest
and unnest
, something like this:
let noDuplicates: Frame<(int*string), string> =
df1
|> Frame.groupRowsBy "Tomas"
|> Frame.nest
|> Series.mapValues (Frame.take 1)
|> Frame.unnest
Let's explain each step. Imagine you have this dataframe:
// Create from individual observations (row * column * value)
let df1 =
[ ("Monday", "Tomas", 1); ("Tuesday", "Adam", 2)
("Tuesday", "Tomas", 4); ("Wednesday", "Tomas", -5)
("Thursday", "Tomas", 4); ("Thursday", "Adam", 5) ]
|> Frame.ofValues
Tomas Adam
Monday -> 1 <missing>
Tuesday -> 4 2
Wednesday -> -5 <missing>
Thursday -> 4 5
And you want to remove rows containing duplicate values in the "Tomas" column.
First, group by this column.
let df2 : Frame<(int * string), string> = df1 |> Frame.groupRowsBy "Tomas"
Tomas Adam
1 Monday -> 1 <missing>
4 Tuesday -> 4 2
4 Thursday -> 4 5
-5 Wednesday -> -5 <missing>
Now you have a frame with a two-level index, which you can turn into a series of data frames.
let df3 = df2 |> Frame.nest
Tomas Adam
Monday -> 1 <missing>
Tomas Adam
Tuesday -> 4 2
Thursday -> 4 5
Tomas Adam
Wednesday -> -5 <missing>
Take the first row of each frame.
let df4 = df3 |> Series.mapValues (fun fr -> fr |> Frame.take 1)
Tomas Adam
Monday -> 1 <missing>
Tomas Adam
Tuesday -> 4 2
Tomas Adam
Wednesday -> -5 <missing>
It remains to perform the backwards conversion: from a series of data frames into a frame with a two-level index.
let df5 = df4 |> Frame.unnest
Tomas Adam
-5 Wednesday -> -5 <missing>
1 Monday -> 1 <missing>
4 Tuesday -> 4 2
Upvotes: 2