Code works on individual files, but data gets jumbled when looping through directory

Question

I have one main dataframe that contains data for ~25 people with 10 trials per person. I also have individual files for each trial by participant. My goal is to have one file per participant that contains all 10 trials with data from both the main dataframe and the individual files.

I am matching the data in the main dataframe and the files in the directory by filename (the filename contains both the participant ID and the trial number- ex: 90-9.csv).

Example of the data:

# Main df:
ID    trial  file       length
90    9      90-9.csv   56
90    10     90-10.csv  44
91    1      91-1.csv   62
91    2      91-2.csv   48

# Individual files in directory- these files contain diameter data:
90-9.csv
90-10.csv
91-1.csv
91-2.csv

# intended output:
ID  trial  file      length  diameter
..  ..     ..        ..      ..
90  8      90-8.csv  62      3.15
90  9      90-9.csv  56      3.17
90  10     90-10.csv 44      3.14

I have tried the following for looping through the directory:

directory = os.chdir(r'filepath')

# create list of files
dir_list = os.listdir(directory)

for file in dir_list:
    df = pd.read_csv('mainDF.csv')
    # create filename column in main df
    df['ID'] = df['ID'].astype(str)
    df['trial'] = df['trial'].astype(str)
    df['file'] = df['ID']+'-'+df['trial']+'.csv'
    
# this doesn't work
    for file in zoom['filename']:
          pupil = pd.read_csv(file)

# this one doesn't organize the data properly
     if ([x in file for x in zoom['filename']]):
         pupil = pd.read_csv(file)

When I isolate one participant and one trial number, the data is organized the way I show in the intended output. When I loop through the directory, everything becomes out of order. I'm not sure what's going on.

ticster · Accepted Answer

os.listdir will list the files (and folders) in the directory given to it. However, if you want to then access that file, you need to give the full path. So directory+"/"+file. In addition, you're reloading the mainDF.csv dataframe on each iteration. I don't think you actually want to do that, you probably just want to do that once before you start your for loop. You also seem to be trying to iterate on your files twice. I have no idea what your zoom variable even is, let alone what you hope to achieve by looping over it.

What I would advise is that you rewrite this code using glob instead of listdir. That should be safer by only selecting the specific files you want and won't require any file manipulation on your part. So something like this :

import glob

import pandas as pd


df = pd.read_csv('mainDF.csv')
df['ID'] = df['ID'].astype(str)
df['trial'] = df['trial'].astype(str)
df['file'] = df['ID']+'-'+df['trial']+'.csv'
for path in glob.glob("path/to/files/*.csv"):
    pupil_df = pd.read_csv(path)
    # do what you want with pupil_df here

Code works on individual files, but data gets jumbled when looping through directory

Answers (1)

Related Questions