Cagri
Cagri

Reputation: 727

How to remove duplicate groups in a pandas dataframe?

I would like to remove duplicated sets in the dataframe.

import pandas as pd
import pdb
filename = "result_4_tiling_116.csv"

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, names=['id', 'tileID', 'x', 'y', 'h', 'w'], chunksize=chunksize):
    pdb.set_trace()

An example of the first 31 lines of the data:

chunk.head(31)

      tileID  x     y  h    w
0          0  0   0.0  1  8.0
1          1  0   8.0  1  8.0
2          0  0   8.0  1  8.0
3          1  0   0.0  1  4.0
4          2  0   4.0  1  4.0
5          0  0   0.0  1  4.0
6          1  0   4.0  1  4.0
7          2  0   8.0  1  4.0
8          3  0  12.0  1  4.0
9          0  0   4.0  1  4.0
10         1  0   8.0  1  4.0
11         2  0  12.0  1  4.0
12         3  0   0.0  1  2.0
13         4  0   2.0  1  2.0
14         0  0   8.0  1  4.0
15         1  0  12.0  1  4.0
16         2  0   0.0  1  2.0
17         3  0   2.0  1  2.0
18         4  0   4.0  1  2.0
19         5  0   6.0  1  2.0
20         0  0  12.0  1  4.0
21         1  0   0.0  1  2.0
22         2  0   2.0  1  2.0
23         3  0   4.0  1  2.0
24         4  0   6.0  1  2.0
25         0  0   8.0  1  4.0
26         1  0  12.0  1  4.0
27         2  0   0.0  1  2.0
28         3  0   2.0  1  2.0
29         4  0   4.0  1  2.0
30         5  0   6.0  1  2.0

I would like to filter out the duplicated ones. The data contains a set of groups (for each starting with tileID=0), as follows: 1.

0          0  0   0.0  1  8.0
1          1  0   8.0  1  8.0

2.

2          0  0   8.0  1  8.0
3          1  0   0.0  1  4.0
4          2  0   4.0  1  4.0

3.

5          0  0   0.0  1  4.0
6          1  0   4.0  1  4.0
7          2  0   8.0  1  4.0
8          3  0  12.0  1  4.0

4.

9          0  0   4.0  1  4.0
10         1  0   8.0  1  4.0
11         2  0  12.0  1  4.0
12         3  0   0.0  1  2.0
13         4  0   2.0  1  2.0

5.

14         0  0   8.0  1  4.0
15         1  0  12.0  1  4.0
16         2  0   0.0  1  2.0
17         3  0   2.0  1  2.0
18         4  0   4.0  1  2.0
19         5  0   6.0  1  2.0

6.

20         0  0  12.0  1  4.0
21         1  0   0.0  1  2.0
22         2  0   2.0  1  2.0
23         3  0   4.0  1  2.0
24         4  0   6.0  1  2.0

7.

25         0  0   8.0  1  4.0
26         1  0  12.0  1  4.0
27         2  0   0.0  1  2.0
28         3  0   2.0  1  2.0
29         4  0   4.0  1  2.0
30         5  0   6.0  1  2.0

In this example, 5 and 7 are duplicated data. I try to use drop_duplicates, but no success yet.

Upvotes: 2

Views: 654

Answers (2)

Georgy
Georgy

Reputation: 13697

This may be not the most efficient way to solve this problem, but it gives correct results.

Let df will be your initial dataframe:

unique_chunks = []
for _, chunk in df.groupby((df['tileID'].diff() != 1).cumsum()):
    unindexed_chunk = chunk.reset_index(drop=True)

    for unique_chunk in unique_chunks:
        unindexed_unique_chunk = unique_chunk.reset_index(drop=True)
        if unindexed_chunk.equals(unindexed_unique_chunk):
            break
    else:
        unique_chunks.append(chunk)

output_df = pd.concat(unique_chunks)

will give:

    tileID  x     y  h    w
0        0  0   0.0  1  8.0
1        1  0   8.0  1  8.0
2        0  0   8.0  1  8.0
3        1  0   0.0  1  4.0
4        2  0   4.0  1  4.0
5        0  0   0.0  1  4.0
6        1  0   4.0  1  4.0
7        2  0   8.0  1  4.0
8        3  0  12.0  1  4.0
9        0  0   4.0  1  4.0
10       1  0   8.0  1  4.0
11       2  0  12.0  1  4.0
12       3  0   0.0  1  2.0
13       4  0   2.0  1  2.0
14       0  0   8.0  1  4.0
15       1  0  12.0  1  4.0
16       2  0   0.0  1  2.0
17       3  0   2.0  1  2.0
18       4  0   4.0  1  2.0
19       5  0   6.0  1  2.0
20       0  0  12.0  1  4.0
21       1  0   0.0  1  2.0
22       2  0   2.0  1  2.0
23       3  0   4.0  1  2.0
24       4  0   6.0  1  2.0

The idea here is to iterate over the chunks of the initial dataframe, collect them in a list, and check the chunk on the current iteration if it is already present in that list. Don't forget to reset indices!

For explanations on how to iterate over chunks, see this answer.


Edit:
For the case of a very huge input file of ~20Gb you can try saving processed unique chunks to a file instead of keeping them in a list, and reading them back by chunks, in the same manner as you do it with your input file.

Upvotes: 1

Zaynul Abadin Tuhin
Zaynul Abadin Tuhin

Reputation: 32003

but look drop_duplicates is works

import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
print(df)

Upvotes: 1

Related Questions