ankitpandey
ankitpandey

Reputation: 349

List down duplicate files from a list in Python

I am trying to list down all files with similar name into separate folder but can't figure out there names to move them. From below i am creating folder with Duplicate name in my Working directory and then passing all files through split function to get middle part of name from xmlName = xml.split('.')[1] line. Now xmlName has only part of file name which helps me me to decided duplicate name.

Below is the list of file in working directory:

# ls
CRON.JC_ADA_SOURCE_DLOAD.xml            Duplicate                                   TERA.SC_CACHE_PURGE_01.xml
CRON.JC_ADA_SOURCE_WLOAD.xml            POWE.BI_RUN_INFO_WKFLW_INF1.xml  test.py
CRON.SC_ADA_CLEANUP_SCRIPT.xml          POWE.JC_ADA_SOURCE_DLOAD.xml            Unknown
CRON.SC_CACHE_PURGE_01.xml              POWE.SC_CHECK_ADA_DATA_FILE_INF2.xml
#

Below is the code (Where i am not sure how to list down only duplicate files).

#!/usr/bin/python

import os, sys

Working_Dir = "/home/export/Partition/JobDefinition"

if not os.path.exists('./Duplicate'):
    os.makedirs('./Duplicate', 0755)

for path, dir, files in os.walk(Working_Dir):
    for xml in files:
        xmlName = xml.split('.')[1]
        if xmlName == xmlName:
            print xmlName

Output:

# python test.py
SC_ADA_CLEANUP_SCRIPT
SC_CHECK_ADA_DATA_FILE_INF2
JC_ADA_SOURCE_WLOAD
BI_RUN_INFO_WKFLW_INF1
JC_ADA_SOURCE_DLOAD
SC_CACHE_PURGE_01
JC_ADA_SOURCE_DLOAD
SC_CACHE_PURGE_01
py
#

What output i need is below names so that i can move respective file to Duplicate folder:

JC_ADA_SOURCE_DLOAD
SC_CACHE_PURGE_01

Upvotes: 1

Views: 195

Answers (3)

Joe T. Boka
Joe T. Boka

Reputation: 6587

If you are trying to find the duplicate elements in your list and create an other list for only those duplicate elements this is how you can do it:

Here I have list a with two duplicate elements in it 2 and 3. I find those elements in list a and create an other list b which will contain only those two elements.

import collections
a = [1,2,3,4,5,6,2,3,]
b = [item for item, count in collections.Counter(a).items() if count > 1]

When you print b the output is:

[2, 3]

Then, later if you also want to remove the duplicate elements from a, you can use set to do it like this:

a = set([1,2,3,4,5,6,2,3,])

Now when you print a the output is:

set([1, 2, 3, 4, 5, 6])

Upvotes: 0

NightShadeQueen
NightShadeQueen

Reputation: 3334

The Lazy Answer

collections.Counter will do what you want, by magic.

import collections

c = collections.Counter([])

for path, dir, files in os.walk(Working_Dir):
    c += collections.Counter([xml.split('.')[1] for xml in files])

The Somewhat Less Lazy Answer

Keep track of every unique file with set

seen = set()
duplicates = set()
for path, dir, files in os.walk(Working_Dir):
    for xml in files:
        xmlName = xml.split('.')[1]
        if xmlName in seen:
            duplicates.add(xmlName)
        seen.add(xmlName)

Upvotes: 1

Scott Hunter
Scott Hunter

Reputation: 49803

If you only want duplicates, you can store names as you find them in something (set would be most appropriate, but a list will do); if something you are about to put in is already there, it is a duplicate.

Upvotes: 0

Related Questions