Pavan Yeddanapudi
Pavan Yeddanapudi

Reputation: 61

Trying to match a filename in a directory and an element in a .csv file in python using pandas

I am trying to iterate through the .jpg files in a directory to match with the names in a single column(image_name) of a .csv file.

import csv
import pandas as pd
import fnmatch
import os


imagenames=pd.read_csv('file.csv',header=0,usecols=['image_name'])
imnum=imagenames.shape[0]

for filename in os.listdir("directory"):
    for i in range(imnum):
        if imagenames.iloc[i] == filename:
            print(imagenames.iloc[i])

I get an error message: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Can anyone help me with the code?

Upvotes: 0

Views: 1020

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

I'd do it this way:

import os
import glob
import pandas as pd

mask = r'/path/to/*.jpg'
jpgs = [os.path.split(f)[1] for f in glob.glob(mask)]
imagenames = pd.read_csv('file.csv',usecols=['image_name'],squeeze=True)

print(imagenames[imagenames.isin(jpgs)])

Upvotes: 1

Peter Mularien
Peter Mularien

Reputation: 2638

Although you don't include the line numbers, I assume the error is on the line imagenames.iloc[i] == filename. You're getting this error because imagenames.iloc[i] results in a Pandas Series object (representing a single column).

You could resolve this by replacing with imagenames.iloc[i]['image_name'], but the resultant code would have 2 loops and be doing a ton of extra work.

Instead, I'd recommend refactoring with the following aim:

  • You have a list of filenames from the CSV
  • You have a list of filenames from the directory listing
  • You want the intersection of these two lists (i.e. filenames which appear in both)

There are several ways to do this, and you don't mention how large these lists are. Let's assume they're relatively small, one way to approach the code which is more in line with Pandas vectorized approaches to data would be:

imagenames=pd.read_csv('file.csv',header=0,usecols=['image_name'])
files_in_dir = os.listdir("directory")
matches = imagenames[imagenames['image_name'].isin(files_in_dir)]

This isn't super efficient as .isin is searching through a list of files, if the list is quite long, it could be potentially slow. You could consider using a set or other optimization if this is the case with your situation.

Upvotes: 1

Related Questions