Reputation: 557
I have a zip file that contains multiple dated folders, in each folder, I have a datestamp.txt which have the date and multiple csv files.
For example:
In the Archives.zip: \Folder1 \Folder2
In each folder:
DATESTAMP.txt
a.csv
b.csv
So I have this zip file drop from upstream which contains multiple days of data, the date info contains in the datestamp.txt file (just a datestamp like 20200903), how can I just process the latest date csv files? ( Folder1/datestamp.txt: 20200903, Folder2/datestamp.txt: 20200904, so I just want to have Folder2's csv files)
I tried to read the date from the txt file first and sort them.
from zipfile import ZipFile
zip_file = ZipFile('data\Archives.zip')
timestamp={text_file.filename: pd.read_csv(zip_file.open(text_file.filename),header=None)
for text_file in zip_file.infolist() if text_file.filename.endswith('.txt')}
dfs = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename))
for text_file in zip_file.infolist() if text_file.filename.endswith('.csv')}
Is there a way I can get the date directly from datestamp.txt and just read latest a.csv and b.csv?
Thank you
Upvotes: 0
Views: 156
Reputation: 3001
Here is a way to find the latest date and corresponding folder. I used defaultdict to show if there is more than one folder with the latest date.
from collections import defaultdict
# create test data
metadata = [
'Folder1/datestamp.txt: 20200903', # Sept 3
'Folder2/datestamp.txt: 20200904',
'Folder2/datestamp.txt: 20200903', # Sept 3 also (impossible?)
]
# initial value is empty list; just append without checking first
latest = defaultdict(list)
for m in metadata:
folder = m.split('/', 1)[0]
datestamp = m.rsplit(' ', 1)[-1]
latest[datestamp].append(folder)
print('max date :', max(latest))
print('folder(s) :', latest[max(latest)])
max date : 20200904
folder(s) : ['Folder2']
Upvotes: 1