Reputation: 7
I am currently on Python 3.4.1, but I don't have access to any modules like Pandas or Numpy on my work pc.
I originally wrote a VBA program in excel where the original data is on Sheet1, the new data in on Sheet2, and the difference between the two sheets is on Sheet3. My program did the following three things:
Since it was excruciatingly slow, I decided to try and learn Python. In Python I'm able to compare the data but am now wanting to be able to sort the columns and order the rows.
For example:
Original CSV #1
Column1,Column2,Column3,Column4,Column5
1,b1,c11111,d1,e1
2,b2,c2,d2,e2
5,b5,c5,d5,e5,
25,b25,c25,d2555,e25
7,b7,c7,d7,e7
Original CSV #2
Column1,Column2,Column3,Column4,Column5
2,b2,c2,d2,e2
1,b1,c1,d1,e1
3,b3,c3,d3,e3
7,b7,c7,d7,e777
25,b25,c25,d25,e25
Since the values in Row 2 are the same in both files, that data is not copied into the results for either file.
Results CSV #1
Column1,Column2,Column3,Column4,Column5
1,b1,c11111,d1,e1
5,b5,c5,d5,e5
7,b7,c7,d7,e7
25,b25,c25,d2555,e25
Results CSV #2
Column1,Column2,Column3,Column4,Column5
1, b1,c1,d1,e1
3,b3,c3,d3,e3
7,b7,c7,d7,e777
25,b25,c25,d25,e25
With the code below, I can accomplish step #3.
strpath = 'C://Users//User//Desktop//compare//'
strFileNameA = 'File1'
strFileNameB = 'File2'
testfile1 = open(strpath + strFileNameA + '.csv', 'r')
testfile2 = open(strpath + strFileNameB + '.csv', 'r')
testresult1 = open(strpath + strFileNameA + '-Results' + '.csv', 'w')
testresult2 = open(strpath + strFileNameB + '-Results' + '.csv', 'w')
testlist1 = testfile1.readlines()
testlist2 = testfile2.readlines()
k=1
z=0
for i,j in zip(testlist1,testlist2):
if k==1:
testresult1.write(i.rstrip('\n') + ('n'))
if i!=j:
testresult1.write(i.rstrip('\n') + ('n'))
testresult2.write(j.rstrip('\n') + ('n'))
z = z+1
k =int(k)
k = k+1
if z ==0:
testresult1.write('Exact match for ' + str(k) + ' rows')
testresult1.write('Exact match for ' + str(k) + ' rows')
testfile1.close()
testfile2.close()
testresult1.close()
testresult2.close()
Upvotes: 0
Views: 2194
Reputation: 1436
This is an excellent exercise to introduce you to programming in Python. There are many string functions that will make many data processing tasks much simpler. You can check the docs for more string functions https://docs.python.org/3/library/string.html.
Firstly, I suggest using os.path.join() to create path strings. Second, I recommend using the builtin method sorted() to sort the lines of the file. Note that you must be careful when you sort because sorting strings is not the same as sorting integers.
Step 1 uses the builtin sorted function to sort every line by column 1. This is accomplished by passing a lambda function as the key parameter. Since Python uses zero based indexing, the reference x[0] uses the first column. So this particular lambda function simply returns the first column of each row.
Step 2 goes through all the rows of each file. If they both match, then they both get paired together. Otherwise, a row is matched to a blank row.
import os
strpath = '.'
strFileNameA = 'file1'
strFileNameB = 'file2'
testfile1 = open(os.path.join(strpath, '%s.csv'%(strFileNameA)), 'r')
testfile2 = open(os.path.join(strpath, '%s.csv'%(strFileNameB)), 'r')
testlist1 = testfile1.readlines()
testlist1 = [eachLine.rstrip("\n").split(",") for eachLine in testlist1]
testlist2 = testfile2.readlines()
testlist2 = [eachLine.rstrip("\n").split(",") for eachLine in testlist2]
#step 1
testlist1 = sorted(testlist1,key=lambda x: x[0])
testlist2 = sorted(testlist2,key=lambda x: x[0])
#step 2
def look_for_match(i,list1,j,list2):
if i == len(list1):
return i,j+1, ([],list2[j])
elif j == len(list2):
return i+1,j,(list1[i],[])
elif list1[i][0] == list2[j][0]:
return i+1, j+1,(list1[i],list2[j])
elif list1[i][0] < list2[j][0]:
return i+1,j,(list1[i],[])
else:
return i,j+1, ([],list2[j])
matched_rows = []
i=0
j=0
while i<len(testlist1) or j<len(testlist2):
i, j, matched_row = look_for_match(i,testlist1,j,testlist2)
if matched_row[0] == [] or matched_row[1] == []:
matched_rows.append(matched_row)
for row_file_1, row_file_2 in matched_rows:
print(row_file_1, row_file_2)
for row_file_1, row_file_2 in matched_rows:
print(row_file_1, row_file_2)
Upvotes: 1
Reputation: 524
I'd suggest either looking at namedtuple: https://docs.python.org/3/library/collections.html#collections.namedtuple
or sqlite: https://docs.python.org/3/library/sqlite3.html#module-sqlite3
Both are available in 3.4.1.
If these aren't suitable (i.e. they are relatively small model point files), you could use the built in set object to compare the two sets of data, and use set operations to filter:
with open('csv1.csv','r') as csv_file1:
header1 = next(csv_file1) #skip header
set1 = set(line for line in csv_file1)
with open('csv2.csv','r') as csv_file2:
header2 = next(csv_file2) #skip header
set2 = set(line for line in csv_file2)
print((set1 - set2) |(set2 - set1))
Once you've finalised the set, you can convert it to a list, sort it, and write out.
Upvotes: 0