Caroline
Caroline

Reputation: 7

Compare one column (vector) from one CSV file with two columns (vector and array) from another CSV file using Python 3.8

I am a beginner and looking for a solution. I am trying to compare columns from two CSV files with no header. The first one has one column and the second one has two.

File_1.csv: #contains 2k rows with random numbers.

1 
4
1005
.
.
.
9563

File_2.csv: #Contains 28k rows

0      [81,213,574,697,766,1074,...21622]
1      [0,1,4,10,12,13,1005, ...31042]
2      [35,103,85,1023,...]
3      [4,24,108,76,...]
4      []
.
.
.
28280  [0,1,9,10,32,49,56,...]

I want first to compare the column of File_1 with the first column of File_2 and find out if they match and extract the matching values plus the second column of file2 into a new CSV file (output.csv) deleting the not matching values. For example,

output.csv:

1      [0,1,4,10,12,13,1005, ...31042]
4      []
.
.
.

Second, I want to compare the File_1.csv column (iterate 2k rows) with the second column (each array) of the output.csv and find the matching values and delete the ones that do not, and I want to save those matching values into the output.csv file and also keeping the first column of that file. For example, 4 was deleted as it didn't have any values in the second column (array) as there were no numbers to compare to File_1, but there are others like 1 that did have some that match"

output.csv:

1      [1,4,1005]
.
.
.

I found a code that works for the first step, but it does not save the second column. I have been looking at how to compare arrays, but I haven't been able to. This is what I have so far,

import csv
nodelist = []
node_matches = []

with open('File_1.csv', 'r') as f_rand_node:
    csv_f = csv.reader(f_rand_node)

    for row in csv_f:
        nodelist.append(row[0])

set_node = set(nodelist)

with open('File_2.csv', 'r') as f_tbl:

    with open('output.csv', 'w') as f_out:

        csv_f = csv.reader(f_tbl)

        for row in csv_f:

            set_row = set(' '.join(row).split(' ')) 
           
            if set_row.intersection(set_node):
                node_match = list(set_row.intersection(set_node))[0]
            
                f_out.write(node_match + '\n')

Thank you for the help.

Upvotes: 0

Views: 299

Answers (2)

Roman
Roman

Reputation: 451

I'd recommend to use pandas for this case.

File_1.csv:

1
4
1005
9563

File_2.csv:

0      [81,213,574,697,766,1074]
1      [0,1,4,10,12,13,1005,31042]
2      [35,103,85,1023]
3      [4,24,108,76]
4      []
5      [0,1,9,10,32,49,56]

Code:

import pandas as pd
import csv


file1 = pd.read_csv('File_1.csv', header=None)
file1.columns=['number']
file2 = pd.read_csv('File_2.csv', header=None, delim_whitespace=True, index_col=0)
file2.columns = ['data']

df = file2[file2.index.isin(file1['number'].tolist())]  # first step
df = df[df['data'] != '[]']  # second step

df.to_csv('output.csv', header=None, sep='\t', quoting=csv.QUOTE_NONE)

Output.csv:

1   [0,1,4,10,12,13,1005,31042]

Upvotes: 0

supercooler8
supercooler8

Reputation: 503

The entire thing is a lot easier with pandas DataFrames:

import pandas as pd

#Read the files into two dataFrames
df1= pd.read_csv("File_1.csv")
df2= pd.read_csv("File_2.csv")

df2.set_index("Column 0")
df2= df2.filter(items = df1)
index= df1.values()
df2 = df2.applymap(lambda x: set(x).intersection(index))
df.to_csv("output.csv")

This should do the trick, quite simply.

Upvotes: 0

Related Questions