Chris Tarn
Chris Tarn

Reputation: 629

Remove rows that are near duplicate of previous row in Deedle dataframe

I have a Deedle Data frame that looks like this.

val it : Frame<int,string> =
          Date                        size1 size2 
13     -> 2013-12-12T00:00:00.103336Z 133   35    
14     -> 2013-12-12T00:00:00.105184Z 83    35    
15     -> 2013-12-12T00:00:00.107205Z 83    35    
16     -> 2013-12-12T00:00:00.109566Z 83    34    
17     -> 2013-12-12T00:00:00.115260Z 83    34    
18     -> 2013-12-12T00:00:00.133546Z 83    34    
20     -> 2013-12-12T00:00:00.138204Z 82    34    
22     -> 2013-12-12T00:00:00.140125Z 81    34 

I would like to remove rows that have the same values for both size1 and size2 as the previous row. In pseudo code...

if row?size1 = prevRow?size1 && row?size2 = prevRow?size2 then dropRow

So in the example above I would end up with:

val it : Frame<int,string> =
          Date                        size1 size2 
13     -> 2013-12-12T00:00:00.103336Z 133   35    
14     -> 2013-12-12T00:00:00.105184Z 83    35    
16     -> 2013-12-12T00:00:00.109566Z 83    34    
20     -> 2013-12-12T00:00:00.138204Z 82    34    
22     -> 2013-12-12T00:00:00.140125Z 81    34 

I believe I want to use

Frame.filterRowValues(row - > )

But I don't see how to compare one row against the previous row. Is there a simple way to do this? Perhaps I need to shift and join?

Upvotes: 1

Views: 620

Answers (1)

Tomas Petricek
Tomas Petricek

Reputation: 243106

This can be done using a number of ways and I'm not quite sure which is the best one:

  • Use shift and join (as you say) would certainly work - you'd need to rename the columns in one of the frames so that you can join them, but it sounds like quite a good solution to me

  • You can use frame.Rows |> Series.pairwise to get tuples containing the current and the previous row, then use Series.filter and Series.map (to select the second row from the tuple) and re-construct frame using Frame.ofRows. The only issue is that you'll always lost the first row this way (and you'll have to add it back).

  • You can use Frame.filter and find the previous row. The recent release supports Lookup.Smaller which lets you do that easily.

The code for the third option looks like this (note that the frame rows need to be ordered frame.Rows.IsOrdered = true) for this to work:

frame |> Frame.filterRows (fun k row ->
  let prev = frame.Rows |> Series.tryLookup k Lookup.Smaller // New in v1.0
  match prev with 
  | Some prev -> prev?Something <> row?Something
  | _ -> true (* always return true for the first row *) )

Upvotes: 3

Related Questions