dsexplorer
dsexplorer

Reputation: 105

Iterate through folders and find a file to put into a dataframe

I have a directory ../customer_data/* with 15 folders. Each folder is a unique customer.

Example: ../customer_data/customer_1

Within each customer folder there is a csv called surveys.csv.

GOAL: I want to iterate through all the folders in ../customer_data/* and find the surveys.csv for each unique customer and create a concatenated dataframe. I also want to add a column in the dataframe where it has the customer id which is the name of the folder.

import glob
import os
rootdir = '../customer_data/*'
dataframes = []
for subdir, dirs, files in os.walk(rootdir):
    
    for file in files:
        csvfiles = glob.glob(os.path.join(rootdir, 'surveys.csv'))
        
        # loop through the files and read them in with pandas
         # a list to hold all the individual pandas DataFrames
      
        df = pd.read_csv(csvfiles)
        df['customer_id'] = os.path.dirname
        dataframes.append(df)
            
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
result.head()

This code is not giving me all 15 files. Please help

Upvotes: 0

Views: 1859

Answers (2)

Umar.H
Umar.H

Reputation: 23099

Let's try pathlib with rglob which will recursively search your directory structure for all files that match a glob pattern. in this instance survey.

import pandas as pd 
from pathlib import Path

root_dir = Path('/top_level_dir/')

files = {file.parent.parts[-1] : file  for file in Path.rglob('*survey.csv')}

df = pd.concat([pd.read_csv(file).assign(customer=name) for name,file in files.items()])

Note you'll need Python 3.4+ for pathlib.

Upvotes: 0

jkr
jkr

Reputation: 19250

You can use the pathlib module for this.

from pathlib import Path
import pandas as pd

dfs = []
for filepath in Path("customer_data").glob("customer_*/surveys.csv"):
    this_df = pd.read_csv(filepath)
    # Set the customer ID as the name of the parent directory.
    this_df.loc[:, "customer_id"] = filepath.parent.name
    dfs.append(this_df)

df = pd.concat(dfs)

Upvotes: 1

Related Questions