Reputation: 12406
I am often vertically concatenating many *.csv files in Pandas. So, everytime I do this, I have to check that all the files I am concatenating have the same number of columns. This became quite cumbersome since I had to figure out a way to ignore the files with more or less columns than what I tell it I need. eg. the first 10 files have 4 columns but then file #11 has 8 columns and file #54 has 7 columns. This means I have to load all files - even the files that have the wrong number of columns. I want to avoid loading those files and then trying to concatenate them vertically - I want to skip them completely.
So, I am trying to write a Unit Test with Pandas that will: a. check the size of all the *.csv files in some folder b. ONLY read in the files that have a pre-determined number of columns c. print a message indicating the naems of the *.csv files have the wrong number of columns
Here is what I have (I am working in the folder C:\Users\Downloads):
import unittest
import pandas as pd
from os import listdir
# Create csv files:
df1 = pd.DataFrame(np.random.rand(10,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(10,3), columns = ['A', 'B', 'C'])
df1.to_csv('test1.csv')
df1.to_csv('test2.csv')
class Conct(unittest.TestCase):
"""Tests for `primes.py`."""
TEST_INP_DIR = 'C:\Users\Downloads'
fns = listdir(TEST_INP_DIR)
t_fn = fn for fn in fns if fn.endswith(".csv") ]
print t_fn
dfb = pd.DataFrame()
def setUp(self):
for elem in Conct.t_fn:
print elem
fle = pd.read_csv(elem)
try:
pd.concat([Conct.dfb,fle],axis = 0, join='outer', join_axes=None, ignore_index=True, verify_integrity=False)
except IOError:
print 'Error: unable to concatenate a file with %s columns.' % fle.shape[1]
self.err_file = fle
def tearDown(self):
del self.err_fle
if __name__ == '__main__':
unittest.main()
Problem: I am gettingthis output:
['test1.csv', 'test2.csv']
----------------------------------------------------------------------
Ran 0 tests in 0.000s
OK
The first print statement works - it is printing a list of *.csv files, as expected. But, for some reason, the second and third print statements do not work.
Also, the concatenation should not have gone through - the second file has 3 columns but the first one has got 4 columns. The IOerror line does not seem to be printing.
How can I use a Python unittest
to check each of the *.csv files to make sure that they have the same number of columns before concatenation? And how can I print the appropriate error message at the correct time?
Upvotes: 2
Views: 4116
Reputation: 30434
On second thought, instead of chunksize, just read in the first row and count the number of columns, then read and append everything with the correct number of columns. In short:
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
Here's the full version:
df1 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(2,3), columns = ['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df1.to_csv('test1.csv',index=False)
df2.to_csv('test2.csv',index=False)
df3.to_csv('test3.csv',index=False)
files = ['test1.csv', 'test2.csv', 'test3.csv']
df = pd.DataFrame()
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
In [54]: df
Out [54]:
A B C D
0 0.308734 0.242331 0.318724 0.121974
1 0.707766 0.791090 0.718285 0.209325
0 0.176465 0.299441 0.998842 0.077458
1 0.875115 0.204614 0.951591 0.154492
(Edit to add) Regarding the use of nrows
for the test...
line: The only point of the test line is to read in enough of the CSV so that on the next line we check if it has the right number of columns before reading in. In this test case, reading in the first row is sufficient to figure out if we have 3 or 4 columns, and it's inefficient to read in more than that, although there is no harm in leaving off the nrows=1
besides reduced efficiency.
In other cases (e.g. no header row and varying numbers of columns in the data), you might need to read in the whole CSV. In that case, you'd be better off doing it like this:
for f in files:
test = pd.read_csv( f )
if len( test.columns ) == 4:
df = df.append( test )
The only downside of that way is that you completely read in the datasets with 3 columns that you don't want to keep, but you also don't read in the good datasets twice that way. So that's definitely a better way if you don't want to use nrows
at all. Ultimately, depends on what your actual data looks like as to which way is best for you, of course.
Upvotes: 1