Reputation: 136

Python loop through list of csv and check for value?

I have five .csv's that have the same fields in the same order that need to be processed as such:

Get list of files
Make each file into a dataframe
Check if a column of letter-number combinations has a specific value (different for each file) eg: check if the number PT333 is in column1 for the file name data1:

column1   column2    column3    
PT389     LA       image.jpg
PT372     NY       image2.jpg

If the column has a specific value, print which value it has and the filename/variable name that i've assigned to that file, and then rename that dataframe to output1

I tried to do this, but I don't know how to make it loop and do the same thing for each file. At the moment it returns the number, but I also want it to return the data frame name, and I also want it to loop through all the files (a to e) to check for all the values in the numbers list.

This is what I have:

import os
import glob
import pandas as pd
from glob import glob
from os.path import expanduser

home = expanduser("~")
os.chdir(home + f'/files/')

data = glob.glob('data*.csv')
data

# If you have tips on how to loop through these rather than 
# have a line for each one, open to feedback
a = pd.read_csv(data[0], encoding='ISO-8859-1', error_bad_lines=False)
b = pd.read_csv(data[1], encoding='ISO-8859-1', error_bad_lines=False)
c = pd.read_csv(data[2], encoding='ISO-8859-1', error_bad_lines=False)
d = pd.read_csv(data[3], encoding='ISO-8859-1', error_bad_lines=False)
e = pd.read_csv(data[4], encoding='ISO-8859-1', error_bad_lines=False)
filenames = [a,b,c,d,e]
filelist= ['a','b','c','d','e']

# I am aware that this part is repetitive. Unsure how to fix this,
# I keep getting errors
# Any help appreciated
numbers = ['PT333', 'PT121', 'PT111', 'PT211', 'PT222']
def type():
    for i in a.column1:
        if i == numbers[0]:
            print(numbers[0])
        elif i == numbers[1]:
            print(numbers[1])
        elif i == numbers[2]:
            print(numbers[2])
        elif i == numbers[3]:
            print(numbers[3])
        elif i == numbers[4]:
            print(numbers[4])
type()

Also happy to take any constructive criticism as to how to repeat less code and make things smoother. TIA

Upvotes: 2

Answers (3)

r.ook

Reputation: 13858

Give this a try

for file in glob.glob('data*.csv'):       # loop through each file
    df = pd.read_csv(file,                # create the DataFrame of the file
             encoding='ISO-8859-1', 
             error_bad_lines=False)
    result = df.where( \                  # Check where the DF contains these numbers
                 df.isin(numbers)) \
                .melt()['value'] \        # melt the DF to be a series of 'value'
                .dropna() \               # Remove any nans (non match)
                .unique().tolist()        # Return the unique values as a list.
    if result:                            # If there are any results 
        print(file, ', '.join(result)     # print the file name, and the results

Remove the comments and trailing spaces if you are copying and pasting the code. for the result line, in case you run into SyntaxError.

As mentioned you should be able to do the same without DataFrame as well:

for file in glob.glob('data*.csv'):
    data = file.read()
    for num in numbers:
        if num in data:
            print(file, num)

Upvotes: 1

W-B

Reputation: 1287

Also happy to take any constructive criticism as to how to repeat less code and make things smoother.

I hope you don't mind that i started with code restructure. it makes explaining the next steps easier

loading the Files Array

Using list builder allows us to iterate through the files and load them into an a list in 1 line. It also has a lot of memory and time benefits.

files = [pd.read_csv(entry, encoding='ISO-8859-1', error_bad_lines=False) for entry in data]

Type Function

First we need an argument so that we can give call this function for any given file. Along with the list we can loop over it with a for each loop.

Calling the Type Function on Multiple Files

We use for each loops again here

for file in files:
    type(file)

Result


import os
import glob
import pandas as pd
from glob import glob
from os.path import expanduser

home = expanduser("~")
os.chdir(home + f'/files/')

#please note that i am use glob instead of glob.glob here.
data = glob('data*.csv')
files = [pd.read_csv(entry, encoding='ISO-8859-1', error_bad_lines=False) for entry in data]


numbers = ['PT333', 'PT121', 'PT111', 'PT211', 'PT222']

def type(file):
    for value in file.column1:
        if value in numbers:
            print(value)

for file in files:
    type(file)

Upvotes: 1

user5138047

Reputation: 39

I would suggest changing the type function, and calling it slightly differently

    def type(x):
        for i in x.column1:
            if i == numbers[0]:
                print(i, numbers[0])
            elif i == numbers[1]:
                print(i, numbers[1])
            elif i == numbers[2]:
                print(i, numbers[2])
            elif i == numbers[3]:
                print(i, numbers[3])
            elif i == numbers[4]:
                print(i, numbers[4])

    for j in filenames:
            type(j)

Upvotes: 0