Reputation: 1430
I have an Excel file which contains filepaths to folders in a column. There could be several filepaths stored in a row. I can read the excel file into pandas like this.
Now what I want to do is iterate through my pandas DataFrame df
row by row and extract the directories stored so I can use them as input directory for other functions.
If I access the rows in the dataframe with iloc I get a str
like object when what I would like is to have each row of the type list
so I can iterate through it.
An example of the format of the variables in my dataframe.
import pandas as pd
path_1 = '[\'C:\\\\tmp_patients\\\\Pat_MAV_BE_B01_\']'
path_2 = '[\'C:\\\\tmp_patients\\\\Pat_MAV_B16\', \'C:\\\\tmp_patients\\\\Pat_MAV_BE_B16_2017-06-30_08-49-28\']'
d = {'col1': [path_1, path_2]}
df = pd.DataFrame(data=d)
#or read directly excel
# df= pd.read_excel(filepath_to_excel)
for idx in range(len(df)):
paths = df['col1'].iloc[idx]
for a_single_path in paths:
print(a_single_path)
# todo: process all the files found at the location "a single path" with os.walk
How the data looks after i read the file with pd.read_excel()
Upvotes: 4
Views: 4639
Reputation: 62413
file_path_lists
, but the name of the column in the question screenshot is col1
from pathlib import Path
from ast import literal_eval
df = pd.read_excel('test.xlsx')
str
to list
and explode
each list
into a separate row:df.file_path_lists = df.file_path_lists.apply(literal_eval)
df2 = pd.DataFrame(df.explode('file_path_lists'))
df2.dropna(inplace=True)
print(df2.file_path_lists[0])
>>> 'C:\\tmp_patients\\Pat_MAV_BE_B01_'
str
pathlib
objects:os
. Python 3's pathlib Module: Taming the File Systemdf2.file_path_lists = df2.file_path_lists.apply(Path)
print(df2.file_path_lists[0])
>>> WindowsPath('C:/tmp_patients/Pat_MAV_BE_B01_')
pathlib
object.for dir in df2.file_path_lists:
print(dir)
print(type(dir))
>>> C:\tmp_patients\Pat_MAV_BE_B01_
<class 'pathlib.WindowsPath'>
C:\tmp_patients\Pat_MAV_B16
<class 'pathlib.WindowsPath'>
C:\tmp_patients\Pat_MAV_BE_B16_2017-06-30_08-49-28
<class 'pathlib.WindowsPath'>
for dir in df2.file_path_lists:
patient_files = list(dir.glob('*.*')) # use .rglob if there are subdirs
print(patient_files)
lists
instead of a row of each directory:.explode
df = pd.read_excel('test.xlsx')
df.file_path_lists = df.file_path_lists.apply(literal_eval)
print(type(df.file_path_lists[0]))
>>> list
for row in df.file_path_lists: # iterate the row
for x in row: # iterate the list inside the row
print(x)
>>> C:\tmp_patients\Pat_MAV_BE_B01_
C:\tmp_patients\Pat_MAV_B16
C:\tmp_patients\Pat_MAV_BE_B16_2017-06-30_08-49-28
Upvotes: 2
Reputation: 93161
Your sample input has strings that look like arrays. I think read_excel
will NOT do that so you don't need the .apply(literal_eval)
call below.
Assuming you are on pandas 0.25 or later so you can use explode
:
from ast import literal_eval
path_1 = "['C:\\\\develop\\\\python-util-script\\\\Pat_MAV_B01']"
path_2 = "['C:\\\\develop\\\\python-util-script\\\\Pat_MAV_B16', 'C:\\\\develop\\\\python-util-script\\\\Pat_MAV_BE_B16_2017-06-30_08-49-28']"
d = {'col1': [path_1, path_2]}
df = pd.DataFrame(data=d)
df['col1'].apply(literal_eval).explode()
Output:
0 C:\develop\python-util-script\Pat_MAV_B01
1 C:\develop\python-util-script\Pat_MAV_B16
1 C:\develop\python-util-script\Pat_MAV_BE_B16_2...
Name: col1, dtype: object
Upvotes: 0