Filter objects in python list on the basis of string value

Question

Hi I have a data like this

data = [{'name': 'root/folder1/f1/s1.csv' , 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f2/s2/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}, 
        {'name': 'root/folder2/f_1/f_2/f_3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f_1/f_2/f_3/f_4/f_5/file.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}, 
        {'name': 'root/folder2/f3/s3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder3/f3/s3/s4/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name' : 'root/folder3/f3/s3/s4/s5/s6/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}
       ]

I want to get files in each folder with minimum path for example in folder1 there is only 1 file then it will come same way. in folder2 2 path carrying a file for example root/folder2/f_1/f_2/f_3 and this path root/folder2/f_1/f_2/f_3/f_4/f_5 so I want to get minimum here . and a 3rd path aswell exist in folder2 'root/folder2/f3/s3/file.csv' but it will come as it is. and folder3 will as well get file with minimum path like root/folder3/f3/s3/s4/file4.csv

Expected output

data = [{'name': 'root/folder1/f1/s1.csv'},
        {'name': 'root/folder2/f2/s2/file.csv'}, 
        {'name': 'root/folder2/f_1/f_2/f_3/file.csv'},
        {'name': 'root/folder2/f3/s3/file.csv'},
        {'name': 'root/folder3/f3/s3/s4/file4.csv'}
       ]

Tried till now: I am trying to get paths with minimum slashes but not sure how to check for each sub folder etc for example did this

data_dict = {}
for item in data:
    dir = os.path.dirname(item['name'])
    if dir not in data_dict:
        item['count'] = 1
        data_dict[dir] = item
    else:
        count = data_dic[dir]['count'] + 1
        if item['last_modified'] > data_dict[dir]['last_modified']:
            data_dict[dir] = item
        data_dic[dir]['count'] = count

result = list(data_dict.values())

Alexander · Accepted Answer

Something like this would probably work.

import os
import datetime
from collections import Counter

data = [{'name': 'root/folder1/f1/s1.csv' , 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f2/s2/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f_1/f_2/f_3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f_1/f_2/f_3/f_4/f_5/file.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f3/s3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder3/f3/s3/s4/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name' : 'root/folder3/f3/s3/s4/s5/s6/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}
       ]

results = []

# this next line creates a list of all the paths minus their file name
# and counts them, which shows us how many duplicate paths there are
# so we can filter those based on the timestamp later on
paths = Counter([os.path.dirname(i['name']) for i in data])

for row in data:
    name = row["name"]
    path, filename = os.path.split(name) # split the path from filename

    # this next block is where we check if duplicate counter is greater
    # than 1 and if it is it compares the timestamps and either
    # ignores the entry if it isn't the most recent, or it allows
    # the loop to continue through the rest of the logic
    # if you want to allow to keep 2 files instead of 1 >>>
    if paths[path] > 1:
        # this `lst` contains only the duplicate files paths with different file names 
        lst = [i for i in data if i['name'].startswith(path)]
        # >>> you would run this next line again after removing the
        # the first result from the `lst` above, and allow the script
        # to continue for both of the collected output files.
        least = min(lst, key=lambda x: x['last_modified'])
        if least['name'] != name:
            continue

    # this next loop is where it simply goes through each parent 
    # directory and checks if it has already seen the exact path 
    # as the current path, if it has then it breaks and continues
    # to next item in `data` >>>
    while path:
        dirname = os.path.dirname(path) 
        if dirname in paths:
            break
        path = dirname
    # >>> if it doesn't then that means it is the shallowest copy
    # so it appends the full pathname to the results list
    else:
        results.append({'name': name})

print(results)

OUTPUT

[
  {'name': 'root/folder1/f1/s1.csv'}, 
  {'name': 'root/folder2/f2/s2/file.csv'}, 
  {'name': 'root/folder2/f_1/f_2/f_3/file.csv'}, 
  {'name': 'root/folder2/f3/s3/file.csv'}, 
  {'name': 'root/folder3/f3/s3/s4/file4.csv'}
]

Filter objects in python list on the basis of string value

Answers (1)

Related Questions