RMS
RMS

Reputation: 1430

Extract filepaths from pandas DataFrame python

I have an Excel file which contains filepaths to folders in a column. There could be several filepaths stored in a row. I can read the excel file into pandas like this.

Now what I want to do is iterate through my pandas DataFrame df row by row and extract the directories stored so I can use them as input directory for other functions.

If I access the rows in the dataframe with iloc I get a str like object when what I would like is to have each row of the type list so I can iterate through it.

An example of the format of the variables in my dataframe.

import pandas as pd

path_1 = '[\'C:\\\\tmp_patients\\\\Pat_MAV_BE_B01_\']'
path_2 =  '[\'C:\\\\tmp_patients\\\\Pat_MAV_B16\', \'C:\\\\tmp_patients\\\\Pat_MAV_BE_B16_2017-06-30_08-49-28\']'
d = {'col1': [path_1, path_2]}
df = pd.DataFrame(data=d)
#or read directly excel 
# df= pd.read_excel(filepath_to_excel)


for idx in range(len(df)):
    paths = df['col1'].iloc[idx]
    for a_single_path in paths:
        print(a_single_path)
        # todo: process all the files found at the location "a single path" with os.walk

How the data looks after i read the file with pd.read_excel() df

Upvotes: 4

Views: 4639

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62413

If you want rows of individual directories:

Data:

enter image description here

  • Note the column name being used is file_path_lists, but the name of the column in the question screenshot is col1
from pathlib import Path
from ast import literal_eval

df = pd.read_excel('test.xlsx')

enter image description here

Convert rows from str to list and explode each list into a separate row:

df.file_path_lists = df.file_path_lists.apply(literal_eval)
df2 = pd.DataFrame(df.explode('file_path_lists'))
df2.dropna(inplace=True)

enter image description here

print(df2.file_path_lists[0])
>>> 'C:\\tmp_patients\\Pat_MAV_BE_B01_'
  • Note the path is still a str

Convert to pathlib objects:

df2.file_path_lists = df2.file_path_lists.apply(Path)
print(df2.file_path_lists[0])
>>> WindowsPath('C:/tmp_patients/Pat_MAV_BE_B01_')
  • Now each is a pathlib object.

Access each directory:

for dir in df2.file_path_lists:
    print(dir)
    print(type(dir))

>>> C:\tmp_patients\Pat_MAV_BE_B01_
    <class 'pathlib.WindowsPath'>

    C:\tmp_patients\Pat_MAV_B16
    <class 'pathlib.WindowsPath'>

    C:\tmp_patients\Pat_MAV_BE_B16_2017-06-30_08-49-28
    <class 'pathlib.WindowsPath'>

Print a list of files found in a patient directory:

for dir in df2.file_path_lists:
    patient_files = list(dir.glob('*.*'))  # use .rglob if there are subdirs
    print(patient_files)

If you want rows of lists instead of a row of each directory:

  • skip .explode
df = pd.read_excel('test.xlsx')
df.file_path_lists = df.file_path_lists.apply(literal_eval)

print(type(df.file_path_lists[0]))
>>> list

for row in df.file_path_lists:  # iterate the row
    for x in row:  # iterate the list inside the row
        print(x)

>>> C:\tmp_patients\Pat_MAV_BE_B01_
    C:\tmp_patients\Pat_MAV_B16
    C:\tmp_patients\Pat_MAV_BE_B16_2017-06-30_08-49-28

Upvotes: 2

Code Different
Code Different

Reputation: 93161

Your sample input has strings that look like arrays. I think read_excel will NOT do that so you don't need the .apply(literal_eval) call below.

Assuming you are on pandas 0.25 or later so you can use explode:

from ast import literal_eval

path_1 = "['C:\\\\develop\\\\python-util-script\\\\Pat_MAV_B01']"
path_2 =  "['C:\\\\develop\\\\python-util-script\\\\Pat_MAV_B16', 'C:\\\\develop\\\\python-util-script\\\\Pat_MAV_BE_B16_2017-06-30_08-49-28']"
d = {'col1': [path_1, path_2]}
df = pd.DataFrame(data=d)

df['col1'].apply(literal_eval).explode()

Output:

0            C:\develop\python-util-script\Pat_MAV_B01
1            C:\develop\python-util-script\Pat_MAV_B16
1    C:\develop\python-util-script\Pat_MAV_BE_B16_2...
Name: col1, dtype: object

Upvotes: 0

Related Questions