Extract filepaths from pandas DataFrame python

Question

I have an Excel file which contains filepaths to folders in a column. There could be several filepaths stored in a row. I can read the excel file into pandas like this.

Now what I want to do is iterate through my pandas DataFrame df row by row and extract the directories stored so I can use them as input directory for other functions.

If I access the rows in the dataframe with iloc I get a str like object when what I would like is to have each row of the type list so I can iterate through it.

An example of the format of the variables in my dataframe.

import pandas as pd

path_1 = '[\'C:\\tmp_patients\\Pat_MAV_BE_B01_\']'
path_2 =  '[\'C:\\tmp_patients\\Pat_MAV_B16\', \'C:\\tmp_patients\\Pat_MAV_BE_B16_2017-06-30_08-49-28\']'
d = {'col1': [path_1, path_2]}
df = pd.DataFrame(data=d)
#or read directly excel 
# df= pd.read_excel(filepath_to_excel)


for idx in range(len(df)):
    paths = df['col1'].iloc[idx]
    for a_single_path in paths:
        print(a_single_path)
        # todo: process all the files found at the location "a single path" with os.walk

How the data looks after i read the file with pd.read_excel()

Trenton McKinney · Accepted Answer

If you want rows of individual directories:

Data:

Note the column name being used is file_path_lists, but the name of the column in the question screenshot is col1

from pathlib import Path
from ast import literal_eval

df = pd.read_excel('test.xlsx')

Convert rows from `str` to `list` and `explode` each `list` into a separate row:

df.file_path_lists = df.file_path_lists.apply(literal_eval)
df2 = pd.DataFrame(df.explode('file_path_lists'))
df2.dropna(inplace=True)

print(df2.file_path_lists[0])
>>> 'C:\tmp_patients\Pat_MAV_BE_B01_'

Note the path is still a str

Convert to `pathlib` objects:

pathlib part of the standard library and should be used instead of os. Python 3's pathlib Module: Taming the File System

df2.file_path_lists = df2.file_path_lists.apply(Path)
print(df2.file_path_lists[0])
>>> WindowsPath('C:/tmp_patients/Pat_MAV_BE_B01_')

Now each is a pathlib object.

Access each directory:

for dir in df2.file_path_lists:
    print(dir)
    print(type(dir))

>>> C:	mp_patients\Pat_MAV_BE_B01_
    

    C:	mp_patients\Pat_MAV_B16
    

    C:	mp_patients\Pat_MAV_BE_B16_2017-06-30_08-49-28

Print a list of files found in a patient directory:

for dir in df2.file_path_lists:
    patient_files = list(dir.glob('*.*'))  # use .rglob if there are subdirs
    print(patient_files)

If you want rows of `lists` instead of a row of each directory:

skip .explode

df = pd.read_excel('test.xlsx')
df.file_path_lists = df.file_path_lists.apply(literal_eval)

print(type(df.file_path_lists[0]))
>>> list

for row in df.file_path_lists:  # iterate the row
    for x in row:  # iterate the list inside the row
        print(x)

>>> C:	mp_patients\Pat_MAV_BE_B01_
    C:	mp_patients\Pat_MAV_B16
    C:	mp_patients\Pat_MAV_BE_B16_2017-06-30_08-49-28

Extract filepaths from pandas DataFrame python

Answers (2)

If you want rows of individual directories:

Data:

Convert rows from `str` to `list` and `explode` each `list` into a separate row:

Convert to `pathlib` objects:

Access each directory:

Print a list of files found in a patient directory:

If you want rows of `lists` instead of a row of each directory:

Related Questions

Extract filepaths from pandas DataFrame python

Answers (2)

If you want rows of individual directories:

Data:

Convert rows from str to list and explode each list into a separate row:

Convert to pathlib objects:

Access each directory:

Print a list of files found in a patient directory:

If you want rows of lists instead of a row of each directory:

Related Questions

Convert rows from `str` to `list` and `explode` each `list` into a separate row:

Convert to `pathlib` objects:

If you want rows of `lists` instead of a row of each directory: