Pandas Read CSV Error When Reading Multiple Files

Question

I have multiple csv files, named as 2C-BEB-29-2009-01-18.csv,2C-BEB-29-2010-02-18.csv,2C-BEB-29-2010-03-28.csv, 2C-ISI-12-2010-01-01.csv, and so on.

2C- Part is default in all csv files.
BEB means name of the recording device
29 stands for the user ID
2009-01-18 stands for the date of the recording.

I have around 150 different IDs and their recordings with different devices. I would like to automate the following approach which I have done for a single user ID for all user IDs

When I use the following code for the single user, namely for pattern='2C-BEB-29-*.csv', in string format. Note that I am in the correct directory.

def pd_read_pattern(pattern):
   files = glob.glob(pattern)

   df = pd.DataFrame()
   for f in files:
       csv_file = open(f)
       a = pd.read_csv(f,sep='\s+|;|,', engine='python')
       #date column should be changed depending on patient id
       a['date'] = str(csv_file.name).rsplit('29-',1)[-1].rsplit('.',1)[0]
       
       #df = df.append(a)
       #df = df[df['hf']!=0]
       
       
   return df.reset_index(drop=True)

To apply the above code for all user IDs, I have read the CSV files in the following way and saved them into a list. To avoid duplicate IDs I have converted the list into set at the end of this snippet.

import glob
lst=[]
for name in glob.glob('*.csv'):
    if len(name)>15:
        a = name.split('-',3)[0]+"-"+name.split('-',3)[1]+"-"+name.split('-',3)[2]+'-*'
        lst.append(a)
lst = set(lst)

Now, having names of unique Ids in this example format: '2C-BEB-29-*.csv'. Withe the help of below code snippet, I am trying to read user IDs. However, I get unicode/decode error in the pd.read_csv row. Could you help me with this issue?

for file in lst:
    #print(type(file))
    files = glob.glob(file)
    #print(files)
    df = pd.DataFrame()
    for f in files:
        csv_file = open(f)
        #print(f, type(f))
        a = pd.read_csv(f,sep='\s+|;|,', engine='python')

        #date column should be changed depending on patient id
        #a['date'] = str(csv_file.name).rsplit(f.split('-',3)[2]+'-',1)[-1].rsplit('.',1)[0]

        #df = df.append(a)
        #df = df[df['hf']!=0]


    #return df.reset_index(drop=True)

dsapprentice · Accepted Answer

Firstly,

import chardet

Then, replace your code snippet of

a =  pd.read_csv(f,sep='\s+|;|,', engine='python')

with this one

with open(f, 'rb') as file: 
   encodings = chardet.detect(file.read())["encoding"] 
   a =  pd.read_csv(f,sep='\s+|;|,', engine='python', encoding=encodings)

Pandas Read CSV Error When Reading Multiple Files

Answers (1)

Related Questions