Muldoon
Muldoon

Reputation: 7

Use Python to sort columns and order rows from csv files

I am currently on Python 3.4.1, but I don't have access to any modules like Pandas or Numpy on my work pc.

I originally wrote a VBA program in excel where the original data is on Sheet1, the new data in on Sheet2, and the difference between the two sheets is on Sheet3. My program did the following three things:

  1. Sort the data based upon the values in the first column (they can be integers or alphanumeric).
  2. Sequentially order the horizontal rows so the items in the first columns match each other; if they don't match then an extra blank row is added so the rows align with each other.
  3. Create a new result tab and compare the rows. If anything is different between them then it copies over the entire row information from the original CSV file.

Since it was excruciatingly slow, I decided to try and learn Python. In Python I'm able to compare the data but am now wanting to be able to sort the columns and order the rows.

For example:
Original CSV #1
Column1,Column2,Column3,Column4,Column5
1,b1,c11111,d1,e1
2,b2,c2,d2,e2
5,b5,c5,d5,e5,
25,b25,c25,d2555,e25
7,b7,c7,d7,e7

Original CSV #2
Column1,Column2,Column3,Column4,Column5
2,b2,c2,d2,e2
1,b1,c1,d1,e1
3,b3,c3,d3,e3
7,b7,c7,d7,e777
25,b25,c25,d25,e25

Since the values in Row 2 are the same in both files, that data is not copied into the results for either file.

Results CSV #1
Column1,Column2,Column3,Column4,Column5
1,b1,c11111,d1,e1

5,b5,c5,d5,e5
7,b7,c7,d7,e7
25,b25,c25,d2555,e25

Results CSV #2
Column1,Column2,Column3,Column4,Column5
1, b1,c1,d1,e1
3,b3,c3,d3,e3

7,b7,c7,d7,e777
25,b25,c25,d25,e25

With the code below, I can accomplish step #3.

 strpath = 'C://Users//User//Desktop//compare//'
 strFileNameA = 'File1'
 strFileNameB = 'File2'

 testfile1 = open(strpath + strFileNameA + '.csv', 'r')
 testfile2 = open(strpath + strFileNameB + '.csv', 'r')

 testresult1 = open(strpath + strFileNameA + '-Results' + '.csv', 'w')
 testresult2 = open(strpath + strFileNameB + '-Results' + '.csv', 'w')

 testlist1 = testfile1.readlines()
 testlist2 = testfile2.readlines()

 k=1
 z=0


 for i,j in zip(testlist1,testlist2):
     if k==1:
         testresult1.write(i.rstrip('\n') + ('n'))
     if i!=j:
         testresult1.write(i.rstrip('\n') + ('n'))
         testresult2.write(j.rstrip('\n') + ('n'))
         z = z+1
     k =int(k)
     k = k+1

 if z ==0:
     testresult1.write('Exact match for ' + str(k) + ' rows')
     testresult1.write('Exact match for ' + str(k) + ' rows')

 testfile1.close()
 testfile2.close()                           
 testresult1.close()
 testresult2.close()

Upvotes: 0

Views: 2194

Answers (2)

goCards
goCards

Reputation: 1436

This is an excellent exercise to introduce you to programming in Python. There are many string functions that will make many data processing tasks much simpler. You can check the docs for more string functions https://docs.python.org/3/library/string.html.

Firstly, I suggest using os.path.join() to create path strings. Second, I recommend using the builtin method sorted() to sort the lines of the file. Note that you must be careful when you sort because sorting strings is not the same as sorting integers.

Step 1 uses the builtin sorted function to sort every line by column 1. This is accomplished by passing a lambda function as the key parameter. Since Python uses zero based indexing, the reference x[0] uses the first column. So this particular lambda function simply returns the first column of each row.

Step 2 goes through all the rows of each file. If they both match, then they both get paired together. Otherwise, a row is matched to a blank row.

import os

strpath = '.'
strFileNameA = 'file1'
strFileNameB = 'file2'

testfile1 = open(os.path.join(strpath, '%s.csv'%(strFileNameA)), 'r')
testfile2 = open(os.path.join(strpath, '%s.csv'%(strFileNameB)), 'r')

testlist1 = testfile1.readlines()
testlist1 = [eachLine.rstrip("\n").split(",") for eachLine in testlist1]
testlist2 = testfile2.readlines()
testlist2 = [eachLine.rstrip("\n").split(",") for eachLine in testlist2]

#step 1
testlist1 = sorted(testlist1,key=lambda x: x[0])
testlist2 = sorted(testlist2,key=lambda x: x[0])

#step 2
def look_for_match(i,list1,j,list2):
    if i == len(list1):
        return i,j+1, ([],list2[j])
    elif j == len(list2):
        return i+1,j,(list1[i],[])
    elif list1[i][0] == list2[j][0]:
        return i+1, j+1,(list1[i],list2[j])
    elif list1[i][0] < list2[j][0]:
        return i+1,j,(list1[i],[])
    else:
        return i,j+1, ([],list2[j])

matched_rows = []
i=0
j=0
while i<len(testlist1) or j<len(testlist2):
    i, j, matched_row = look_for_match(i,testlist1,j,testlist2)
    if matched_row[0] == [] or matched_row[1] == []:
        matched_rows.append(matched_row)


for row_file_1, row_file_2 in matched_rows:
    print(row_file_1, row_file_2)

for row_file_1, row_file_2 in matched_rows:
    print(row_file_1, row_file_2)

Upvotes: 1

Lewis Fogden
Lewis Fogden

Reputation: 524

I'd suggest either looking at namedtuple: https://docs.python.org/3/library/collections.html#collections.namedtuple

or sqlite: https://docs.python.org/3/library/sqlite3.html#module-sqlite3

Both are available in 3.4.1.

If these aren't suitable (i.e. they are relatively small model point files), you could use the built in set object to compare the two sets of data, and use set operations to filter:

with open('csv1.csv','r') as csv_file1:
    header1 = next(csv_file1)   #skip header
    set1 = set(line for line in csv_file1)

with open('csv2.csv','r') as csv_file2:
    header2 = next(csv_file2)   #skip header
    set2 = set(line for line in csv_file2)

print((set1 - set2) |(set2 - set1))

Once you've finalised the set, you can convert it to a list, sort it, and write out.

Upvotes: 0

Related Questions