remove the duplicate rows in CSV file

Question

I have the following python function that exports JSON data to CSV file, it works fine - the keys(csv headers) and values(csv rows) are populated in the CSV, but I'm trying to remove the duplicates rows in the the csv file?

instead of manually removing them in Excel, how do I remove the duplicate values in python?

 def toCSV(res):
        with open('EnrichedEvents.csv', 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ['process_hash', 'process_name', "process_effective_reputation"]
            dict_writer = csv.DictWriter(csvfile, fieldnames=fieldnames,extrasaction='ignore')
            dict_writer.writeheader()
            for r in res:
                dict_writer.writerow(r)

Thank you

for example in the csv, the duplicate row on apmsgfwd.exe information.

duplicate data below:

process_hash    process_name    process_effective_reputation
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
['73ca11f2acf1adb7802c2914e1026db899a3c851cd9500378c0045e0']    c:\users\zdr3dds01\documents\sap\sap gui\export.mhtml   NOT_LISTED
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
['582f018bc7a732d63f624d6f92b3d143', '66505bcb9975d61af14dd09cddd9ac0d11a3e2b5ae41845c65117e7e2b046d37']    c:\users\jij09\appdata\local\kingsoft\power word 2016\2016.3.3.0368\powerword.exe   ADAPTIVE_WHITE_LIST

json data:

[{'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b1bvf6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935830f3a9df8e9b6425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\program files (x86)\toh122soft\thcasdf3\toho34rce.exe', 'process_username': ['JOHN\user1']}, {'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935f3a9df8e9b6830425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\program files (x86)\oft\tf3\tootsice.exe', 'process_username': ['JOHN\user2']}, {'device_name': '6asdsdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f698e11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['9df8ebfc7dcf5935830f3a9b6425c37a', 'ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f7124641ad1734e2f19'], 'process_name': 'c:\program files (x86)\toht\th3\tohce.exe', 'process_username': ['JOHN\user3']}]

balderman · Accepted Answer

Below is standalone example that shows how to filter duplicates. The idea is to get the values of each dict and convert them into tuple. Using a set we can filter out the duplicates.

import csv

csv_columns = ['No', 'Name', 'Country']
dict_data = [
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 2, 'Name': 'Ben', 'Country': ['USA']},

]
csv_file = "Names.csv"

with open(csv_file, 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
    writer.writeheader()
    entries = set()
    for data in dict_data:
        val = tuple(','.join(v) if isinstance(v, list) else v for v in data.values())
        if val not in entries:
            writer.writerow(data)
            entries.add(val)
print('done')

Names.csv

 No,Name,Country
1,Alex,['India']
2,Ben,['USA']

remove the duplicate rows in CSV file

Answers (2)

Related Questions