Reputation: 875
I have 11 files in my directory that starts with 'case-export-'. For each of these files, I want to extract only 5 columns as indicated in the 'initial_columns' variable, save the results to a dataframe called 'whole_file', and write the 'whole_file' dataframe to a CSV. Because I have 11 files that matches 'case-export-' in my directory, I should have 11 CSVs generated by my script.
When I run the scripts below, I only got 1 CSV generated, and that CSV has data from the most recent file read by the For loop.
date = dt.datetime.today().strftime("%m_%d_%Y")
directory = '/Users/myname/Downloads/'
initial_columns = ['Case #','Case Subject','Created At','Labels','Body']
for file in (glob.glob(directory + 'case-export-*')):
whole_file=pd.read_csv(file, usecols = initial_columns, encoding='ISO-8859-1', index_col=None, low_memory=False).replace(np.nan, 'blank', regex=True)
whole_file.to_csv(directory + 'case_export_trimmed_' + date + '_' + str(now.hour) + '_' + str(now.minute) + '_' + str(now.second) + '.csv')
I know how to get around this by adding an empty dataframe and then concatenating each 'whole_file' to the dataframe as it goes through the loop. I tried that and it worked perfectly. However, I am trying to understand why my original script that I shown above did not work as expected and why it only produced 1 CSV instead of 11. Please advise.
Upvotes: 0
Views: 1796
Reputation: 375925
Generally, a better solution that to use date (hour/min/second), is to keep the old file name but either prefix or suffix it with trimmed:
for file in (glob.glob(directory + 'case-export-*')):
whole_file = pd.read_csv(file, usecols=initial_columns, encoding='ISO-8859-1', index_col=None, low_memory=False).replace(np.nan, 'blank', regex=True)
trimmed_file = file.replace("case-export-", "case-export-trimmed-")
# or trimmed_file = file.replace(".csv", "-trimmed.csv")
whole_file.to_csv(trimmed_file)
That way you have a clearer record/understanding from looking at the filenames what happened/the intention was...
Upvotes: 0
Reputation: 2755
I think the reason for this is that now is defined outside the for loop you have. Therefore every time you use it is is using the the same hour, minute and second in the csv file name and as a result you are overwriting the one file each time.
If you move the now variable definition to inside your for loop it should work:
for file in (glob.glob(directory + 'case-export-*')):
now = datetime.datetime.now()
date = dt.datetime.today().strftime("%m_%d_%Y")
whole_file=pd.read_csv(file, usecols = initial_columns, encoding='ISO-8859-1', index_col=None, low_memory=False).replace(np.nan, 'blank', regex=True)
whole_file.to_csv(directory + 'case_export_trimmed_' + date + '_' + str(now.hour) + '_' + str(now.minute) + '_' + str(now.second) + '.csv')
*EDIT: Also moved date definition inside your for loop to get accurate date
Also I would do the following to generate the file name:
for file in (glob.glob(directory + 'case-export-*')):
csv_file_name = df.datetime.now().strftime("%m_%d_%Y_%H_%M_%S")
whole_file=pd.read_csv(file, usecols = initial_columns, encoding='ISO-8859-1', index_col=None, low_memory=False).replace(np.nan, 'blank', regex=True)
whole_file.to_csv(directory + 'case_export_trimmed_' + csv_file_name + '.csv')
Upvotes: 1
Reputation: 762
It seems you might have declared the 'date' and 'now' variable outside of the loop, which means
'whole_file.to_csv(directory + 'case_export_trimmed_' + date + '' + str(now.hour) + '' + str(now.minute) + '_' + str(now.second) + '.csv')'
will be over writing to the same file for each of the 11 iterations, hence you will see only the 11th iteration details in the output file
Upvotes: 0