Stanleyrr
Stanleyrr

Reputation: 875

to_csv command in For loop not working as expected

I have 11 files in my directory that starts with 'case-export-'. For each of these files, I want to extract only 5 columns as indicated in the 'initial_columns' variable, save the results to a dataframe called 'whole_file', and write the 'whole_file' dataframe to a CSV. Because I have 11 files that matches 'case-export-' in my directory, I should have 11 CSVs generated by my script.

When I run the scripts below, I only got 1 CSV generated, and that CSV has data from the most recent file read by the For loop.

date = dt.datetime.today().strftime("%m_%d_%Y")
directory = '/Users/myname/Downloads/'
initial_columns = ['Case #','Case Subject','Created At','Labels','Body'] 

for file in (glob.glob(directory + 'case-export-*')):
    whole_file=pd.read_csv(file, usecols = initial_columns, encoding='ISO-8859-1', index_col=None, low_memory=False).replace(np.nan, 'blank', regex=True)   
    whole_file.to_csv(directory + 'case_export_trimmed_' + date + '_' + str(now.hour) + '_' + str(now.minute) + '_' + str(now.second) + '.csv')

I know how to get around this by adding an empty dataframe and then concatenating each 'whole_file' to the dataframe as it goes through the loop. I tried that and it worked perfectly. However, I am trying to understand why my original script that I shown above did not work as expected and why it only produced 1 CSV instead of 11. Please advise.

Upvotes: 0

Views: 1796

Answers (3)

Andy Hayden
Andy Hayden

Reputation: 375925

Generally, a better solution that to use date (hour/min/second), is to keep the old file name but either prefix or suffix it with trimmed:

for file in (glob.glob(directory + 'case-export-*')):
    whole_file = pd.read_csv(file, usecols=initial_columns, encoding='ISO-8859-1', index_col=None, low_memory=False).replace(np.nan, 'blank', regex=True)
    trimmed_file = file.replace("case-export-", "case-export-trimmed-")   
    # or trimmed_file = file.replace(".csv", "-trimmed.csv")   
    whole_file.to_csv(trimmed_file)

That way you have a clearer record/understanding from looking at the filenames what happened/the intention was...

Upvotes: 0

cullzie
cullzie

Reputation: 2755

I think the reason for this is that now is defined outside the for loop you have. Therefore every time you use it is is using the the same hour, minute and second in the csv file name and as a result you are overwriting the one file each time.

If you move the now variable definition to inside your for loop it should work:

for file in (glob.glob(directory + 'case-export-*')):
    now = datetime.datetime.now()
    date = dt.datetime.today().strftime("%m_%d_%Y")
    whole_file=pd.read_csv(file, usecols = initial_columns, encoding='ISO-8859-1', index_col=None, low_memory=False).replace(np.nan, 'blank', regex=True)   
    whole_file.to_csv(directory + 'case_export_trimmed_' + date + '_' + str(now.hour) + '_' + str(now.minute) + '_' + str(now.second) + '.csv')

*EDIT: Also moved date definition inside your for loop to get accurate date

Also I would do the following to generate the file name:

for file in (glob.glob(directory + 'case-export-*')):
    csv_file_name = df.datetime.now().strftime("%m_%d_%Y_%H_%M_%S")
    whole_file=pd.read_csv(file, usecols = initial_columns, encoding='ISO-8859-1', index_col=None, low_memory=False).replace(np.nan, 'blank', regex=True)   
    whole_file.to_csv(directory + 'case_export_trimmed_' + csv_file_name + '.csv')

Upvotes: 1

Jithin P James
Jithin P James

Reputation: 762

It seems you might have declared the 'date' and 'now' variable outside of the loop, which means

'whole_file.to_csv(directory + 'case_export_trimmed_' + date + '' + str(now.hour) + '' + str(now.minute) + '_' + str(now.second) + '.csv')'

will be over writing to the same file for each of the 11 iterations, hence you will see only the 11th iteration details in the output file

Upvotes: 0

Related Questions