Reputation: 7
I am a beginner and looking for a solution. I am trying to compare columns from two CSV files with no header. The first one has one column and the second one has two.
File_1.csv: #contains 2k rows with random numbers.
1
4
1005
.
.
.
9563
File_2.csv: #Contains 28k rows
0 [81,213,574,697,766,1074,...21622]
1 [0,1,4,10,12,13,1005, ...31042]
2 [35,103,85,1023,...]
3 [4,24,108,76,...]
4 []
.
.
.
28280 [0,1,9,10,32,49,56,...]
I want first to compare the column of File_1 with the first column of File_2 and find out if they match and extract the matching values plus the second column of file2 into a new CSV file (output.csv) deleting the not matching values. For example,
output.csv:
1 [0,1,4,10,12,13,1005, ...31042]
4 []
.
.
.
Second, I want to compare the File_1.csv column (iterate 2k rows) with the second column (each array) of the output.csv and find the matching values and delete the ones that do not, and I want to save those matching values into the output.csv file and also keeping the first column of that file. For example, 4 was deleted as it didn't have any values in the second column (array) as there were no numbers to compare to File_1, but there are others like 1 that did have some that match"
output.csv:
1 [1,4,1005]
.
.
.
I found a code that works for the first step, but it does not save the second column. I have been looking at how to compare arrays, but I haven't been able to. This is what I have so far,
import csv
nodelist = []
node_matches = []
with open('File_1.csv', 'r') as f_rand_node:
csv_f = csv.reader(f_rand_node)
for row in csv_f:
nodelist.append(row[0])
set_node = set(nodelist)
with open('File_2.csv', 'r') as f_tbl:
with open('output.csv', 'w') as f_out:
csv_f = csv.reader(f_tbl)
for row in csv_f:
set_row = set(' '.join(row).split(' '))
if set_row.intersection(set_node):
node_match = list(set_row.intersection(set_node))[0]
f_out.write(node_match + '\n')
Thank you for the help.
Upvotes: 0
Views: 299
Reputation: 451
I'd recommend to use pandas
for this case.
File_1.csv:
1
4
1005
9563
File_2.csv:
0 [81,213,574,697,766,1074]
1 [0,1,4,10,12,13,1005,31042]
2 [35,103,85,1023]
3 [4,24,108,76]
4 []
5 [0,1,9,10,32,49,56]
Code:
import pandas as pd
import csv
file1 = pd.read_csv('File_1.csv', header=None)
file1.columns=['number']
file2 = pd.read_csv('File_2.csv', header=None, delim_whitespace=True, index_col=0)
file2.columns = ['data']
df = file2[file2.index.isin(file1['number'].tolist())] # first step
df = df[df['data'] != '[]'] # second step
df.to_csv('output.csv', header=None, sep='\t', quoting=csv.QUOTE_NONE)
Output.csv:
1 [0,1,4,10,12,13,1005,31042]
Upvotes: 0
Reputation: 503
The entire thing is a lot easier with pandas DataFrames:
import pandas as pd
#Read the files into two dataFrames
df1= pd.read_csv("File_1.csv")
df2= pd.read_csv("File_2.csv")
df2.set_index("Column 0")
df2= df2.filter(items = df1)
index= df1.values()
df2 = df2.applymap(lambda x: set(x).intersection(index))
df.to_csv("output.csv")
This should do the trick, quite simply.
Upvotes: 0