Reputation: 113
Hi I have a data like this
data = [{'name': 'root/folder1/f1/s1.csv' , 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder2/f2/s2/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder2/f_1/f_2/f_3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder2/f_1/f_2/f_3/f_4/f_5/file.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder2/f3/s3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder3/f3/s3/s4/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name' : 'root/folder3/f3/s3/s4/s5/s6/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}
]
I want to get files in each folder with minimum path for example
in folder1
there is only 1 file then it will come same way.
in folder2
2 path carrying a file for example root/folder2/f_1/f_2/f_3
and this path root/folder2/f_1/f_2/f_3/f_4/f_5
so I want to get minimum here . and a 3rd path aswell exist in folder2
'root/folder2/f3/s3/file.csv'
but it will come as it is. and folder3 will as well get file with minimum path like root/folder3/f3/s3/s4/file4.csv
Expected output
data = [{'name': 'root/folder1/f1/s1.csv'},
{'name': 'root/folder2/f2/s2/file.csv'},
{'name': 'root/folder2/f_1/f_2/f_3/file.csv'},
{'name': 'root/folder2/f3/s3/file.csv'},
{'name': 'root/folder3/f3/s3/s4/file4.csv'}
]
Tried till now: I am trying to get paths with minimum slashes but not sure how to check for each sub folder etc for example did this
data_dict = {}
for item in data:
dir = os.path.dirname(item['name'])
if dir not in data_dict:
item['count'] = 1
data_dict[dir] = item
else:
count = data_dic[dir]['count'] + 1
if item['last_modified'] > data_dict[dir]['last_modified']:
data_dict[dir] = item
data_dic[dir]['count'] = count
result = list(data_dict.values())
Upvotes: 2
Views: 135
Reputation: 17365
Something like this would probably work.
import os
import datetime
from collections import Counter
data = [{'name': 'root/folder1/f1/s1.csv' , 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder2/f2/s2/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder2/f_1/f_2/f_3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder2/f_1/f_2/f_3/f_4/f_5/file.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder2/f3/s3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name': 'root/folder3/f3/s3/s4/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
{'name' : 'root/folder3/f3/s3/s4/s5/s6/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}
]
results = []
# this next line creates a list of all the paths minus their file name
# and counts them, which shows us how many duplicate paths there are
# so we can filter those based on the timestamp later on
paths = Counter([os.path.dirname(i['name']) for i in data])
for row in data:
name = row["name"]
path, filename = os.path.split(name) # split the path from filename
# this next block is where we check if duplicate counter is greater
# than 1 and if it is it compares the timestamps and either
# ignores the entry if it isn't the most recent, or it allows
# the loop to continue through the rest of the logic
# if you want to allow to keep 2 files instead of 1 >>>
if paths[path] > 1:
# this `lst` contains only the duplicate files paths with different file names
lst = [i for i in data if i['name'].startswith(path)]
# >>> you would run this next line again after removing the
# the first result from the `lst` above, and allow the script
# to continue for both of the collected output files.
least = min(lst, key=lambda x: x['last_modified'])
if least['name'] != name:
continue
# this next loop is where it simply goes through each parent
# directory and checks if it has already seen the exact path
# as the current path, if it has then it breaks and continues
# to next item in `data` >>>
while path:
dirname = os.path.dirname(path)
if dirname in paths:
break
path = dirname
# >>> if it doesn't then that means it is the shallowest copy
# so it appends the full pathname to the results list
else:
results.append({'name': name})
print(results)
OUTPUT
[
{'name': 'root/folder1/f1/s1.csv'},
{'name': 'root/folder2/f2/s2/file.csv'},
{'name': 'root/folder2/f_1/f_2/f_3/file.csv'},
{'name': 'root/folder2/f3/s3/file.csv'},
{'name': 'root/folder3/f3/s3/s4/file4.csv'}
]
Upvotes: 2