user1532587
user1532587

Reputation: 1023

Using pandas, how to filter rows with similar values in two columns

I have a big dataframe (~10 millon rows). Each row has:

If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows. For example

1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25

I want to filter out 1 or 2.

What I'm doing right now isn't very efficient,

import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
    dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
    if index in discard:
        continue
    df_2 = dfs[row.category]
    res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
    if len(res.index) > 1:
        discard.extend(res.index.values)
    rows.append(row)
df = pd.DataFrame(rows)

I've also tried a different approach making use of a sorted version of the dataframe.

my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
    row = df.iloc[[my_index]]
    cond = True
    next_index = 1
    while cond:
        second_row = df.iloc[[my_index + next_index]]
        c1 = (row.iloc[0].category == second_row.iloc[0].category)
        c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
        c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
        cond =  c1 and c2 and c3
        if cond and (c2 amd c3):
            indexes.append(my_index)
            cond = True
        next_index += 1
    indexes.append(my_index)
    my_index += next_index
indexes.append(total_len - 1)

The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one

I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.

Upvotes: 1

Views: 1674

Answers (3)

Christoph Burschka
Christoph Burschka

Reputation: 4689

I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:

dfs = []
for seq in df.category.unique():
    dfs[seq] = df[df.category == seq]

Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).

Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:

def drop_duplicates(df):
    df = df.sort_values("sstart")
    ...

This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.

Upvotes: 0

Petronella
Petronella

Reputation: 2535

Do you want to merge all similar or only 2 consecutive? If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row. If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:

merge rows pandas dataframe based on condition

Upvotes: 0

Pavel Kovtun
Pavel Kovtun

Reputation: 367

The approach here should be this:

  1. pandas.groupby by categories
  2. agg(Func) on groupby result
  3. the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)

Upvotes: 1

Related Questions