Trying to find a more efficent way to filter files

Question

Given a list of deviceIds, I am trying to come up with a more efficient way of handling duplicates. When a duplicate is found in the deviceId list, I need to keep only the latest file and delete the others. What i've come up with so far seems to work ok, but i'm wondering if it can be made more efficient? My current method doesn't seem to scale well, for example, it processes 25,000 files in 5 seconds but takes 70 seconds for 100,000 files. Any thoughts?

List filteredList;
        for(int i = 0; i < deviceIds.size(); i++) {
            if(i < (deviceIds.size()-1) && deviceIds.get(i).equals(deviceIds.get(i+1))) {
                filteredList = Lists.newArrayList(Iterables.filter(fileList, new DeviceIdFilter(deviceIds.get(i))));
                Collections.sort(filteredList, new OldestFileComparator());
                for(int t = 0; t < (filteredList.size()-1); t++) {
                    filteredList.get(t).delete();
                }
            }
        }

private static class DeviceIdFilter implements Predicate {
    private String deviceId;
    private DeviceIdFilter(final String deviceId) {
        this.deviceId = deviceId;
    }
    @Override
    public boolean apply(final File file) {
        return file.getName().contains(deviceId);
    }
}

public class OldestFileComparator implements Comparator {
    public int compare(File filea, File fileb) {
        if (filea.lastModified() > fileb.lastModified()) {
            return +1;
        } else if (filea.lastModified() < fileb.lastModified()) {
            return -1;
        } else {
            return 0;
        }
    }
}

Edit:

I implemented TacticalCoders solution which worked wonderfully, processing 100,000 files in 0.60 seconds.

    Map> fileMap = new HashMap>();
    String deviceId;
    List deviceFileList;
    for(File file : fileList) {
        deviceId = getDeviceId(file.getName());
        if(fileMap.containsKey(deviceId)) {
            fileMap.get(deviceId).add(file);
        } else {
            deviceFileList = new LinkedList();
            deviceFileList.add(file);
            fileMap.put(deviceId, deviceFileList);
        }
    }

    for (Map.Entry> mapEntry : fileMap.entrySet()) {
        deviceFileList = mapEntry.getValue();
        if(deviceFileList.size() > 1) {
            Collections.sort(deviceFileList, new OldestFileComparator());
            for(int t = 0; t < (deviceFileList.size()-1); t++) {
                deviceFileList.get(t).delete();
            }
        }

TacticalCoder · Accepted Answer

My current method doesn't seem to scale well, for example, it processes 25,000 files in 5 seconds but takes 70 seconds for 100,000 files. Any thoughts?

That's because you have an O(n^2) algorithm (it can potentially degenerate to much worse than O(n^2) if you happen to have mostly duplicates, in which case you'd be doing an O(n log n) sort in addition to your two for loops, but I take it you don't have 100 000 files being basically always the same duplicate).

If I read the problem correctly you could just do a first pass where you'd build a Map> (where the key would the (sub)String corresponding to the device ID).

After that first pass every file that has a duplicate would be in a list with at least two entries while every file that has no duplicate would be in its own list.

You'd then iterate over your map and everytime you find a List with more than one entry, then you sort that list according to the date and delete all but the latest file.

Would that work?

EDIT you have to be careful with your device IDs: I don't know at all what they look like but if one ID can be, say, "nop100" and another device ID can be, say, "nop1000", then if you process "nop100" before "nop1000" you may be in trouble with your contains method call (because "nop1000" would wrongly match the device ID of "nop100" devices). As far as I can tell this issue exist in the partial code you posted too. There are workarounds of course but it's hard to go further without knowing more about the kind of filenames you're processing.

Trying to find a more efficent way to filter files

Answers (1)

Related Questions