Reputation: 727
I would like to remove duplicated sets in the dataframe.
import pandas as pd
import pdb
filename = "result_4_tiling_116.csv"
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, names=['id', 'tileID', 'x', 'y', 'h', 'w'], chunksize=chunksize):
pdb.set_trace()
An example of the first 31 lines of the data:
chunk.head(31)
tileID x y h w
0 0 0 0.0 1 8.0
1 1 0 8.0 1 8.0
2 0 0 8.0 1 8.0
3 1 0 0.0 1 4.0
4 2 0 4.0 1 4.0
5 0 0 0.0 1 4.0
6 1 0 4.0 1 4.0
7 2 0 8.0 1 4.0
8 3 0 12.0 1 4.0
9 0 0 4.0 1 4.0
10 1 0 8.0 1 4.0
11 2 0 12.0 1 4.0
12 3 0 0.0 1 2.0
13 4 0 2.0 1 2.0
14 0 0 8.0 1 4.0
15 1 0 12.0 1 4.0
16 2 0 0.0 1 2.0
17 3 0 2.0 1 2.0
18 4 0 4.0 1 2.0
19 5 0 6.0 1 2.0
20 0 0 12.0 1 4.0
21 1 0 0.0 1 2.0
22 2 0 2.0 1 2.0
23 3 0 4.0 1 2.0
24 4 0 6.0 1 2.0
25 0 0 8.0 1 4.0
26 1 0 12.0 1 4.0
27 2 0 0.0 1 2.0
28 3 0 2.0 1 2.0
29 4 0 4.0 1 2.0
30 5 0 6.0 1 2.0
I would like to filter out the duplicated ones. The data contains a set of groups (for each starting with tileID=0), as follows: 1.
0 0 0 0.0 1 8.0
1 1 0 8.0 1 8.0
2.
2 0 0 8.0 1 8.0
3 1 0 0.0 1 4.0
4 2 0 4.0 1 4.0
3.
5 0 0 0.0 1 4.0
6 1 0 4.0 1 4.0
7 2 0 8.0 1 4.0
8 3 0 12.0 1 4.0
4.
9 0 0 4.0 1 4.0
10 1 0 8.0 1 4.0
11 2 0 12.0 1 4.0
12 3 0 0.0 1 2.0
13 4 0 2.0 1 2.0
5.
14 0 0 8.0 1 4.0
15 1 0 12.0 1 4.0
16 2 0 0.0 1 2.0
17 3 0 2.0 1 2.0
18 4 0 4.0 1 2.0
19 5 0 6.0 1 2.0
6.
20 0 0 12.0 1 4.0
21 1 0 0.0 1 2.0
22 2 0 2.0 1 2.0
23 3 0 4.0 1 2.0
24 4 0 6.0 1 2.0
7.
25 0 0 8.0 1 4.0
26 1 0 12.0 1 4.0
27 2 0 0.0 1 2.0
28 3 0 2.0 1 2.0
29 4 0 4.0 1 2.0
30 5 0 6.0 1 2.0
In this example, 5 and 7 are duplicated data. I try to use drop_duplicates
, but no success yet.
Upvotes: 2
Views: 654
Reputation: 13697
This may be not the most efficient way to solve this problem, but it gives correct results.
Let df
will be your initial dataframe:
unique_chunks = []
for _, chunk in df.groupby((df['tileID'].diff() != 1).cumsum()):
unindexed_chunk = chunk.reset_index(drop=True)
for unique_chunk in unique_chunks:
unindexed_unique_chunk = unique_chunk.reset_index(drop=True)
if unindexed_chunk.equals(unindexed_unique_chunk):
break
else:
unique_chunks.append(chunk)
output_df = pd.concat(unique_chunks)
will give:
tileID x y h w
0 0 0 0.0 1 8.0
1 1 0 8.0 1 8.0
2 0 0 8.0 1 8.0
3 1 0 0.0 1 4.0
4 2 0 4.0 1 4.0
5 0 0 0.0 1 4.0
6 1 0 4.0 1 4.0
7 2 0 8.0 1 4.0
8 3 0 12.0 1 4.0
9 0 0 4.0 1 4.0
10 1 0 8.0 1 4.0
11 2 0 12.0 1 4.0
12 3 0 0.0 1 2.0
13 4 0 2.0 1 2.0
14 0 0 8.0 1 4.0
15 1 0 12.0 1 4.0
16 2 0 0.0 1 2.0
17 3 0 2.0 1 2.0
18 4 0 4.0 1 2.0
19 5 0 6.0 1 2.0
20 0 0 12.0 1 4.0
21 1 0 0.0 1 2.0
22 2 0 2.0 1 2.0
23 3 0 4.0 1 2.0
24 4 0 6.0 1 2.0
The idea here is to iterate over the chunks of the initial dataframe, collect them in a list, and check the chunk on the current iteration if it is already present in that list. Don't forget to reset indices!
For explanations on how to iterate over chunks, see this answer.
Edit:
For the case of a very huge input file of ~20Gb you can try saving processed unique chunks to a file instead of keeping them in a list, and reading them back by chunks, in the same manner as you do it with your input file.
Upvotes: 1
Reputation: 32003
but look drop_duplicates is works
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
print(df)
Upvotes: 1