fronthem
fronthem

Reputation: 4139

How to get the intersection line between two .csv file?

How to get the intersection line between two .csv file? My question seem quite difficult to understand but i will give you an example like this:

I have 2 .csv files:

+---+-------+----+      +-----------------+
| A |   B   | C  |      | A  |   B   | C  |
+---+-------+----+      +----+-------+----+
| 1 | ant   | 14 |*     | 6  | Fan   | 12 |
| 2 | bird  | 11 |      | 7  | gun   | 55 |*
| 3 | cat   | 21 |*     | 8  | horse | 21 |*
| 4 | dog   | 55 |*     | 9  | ice   | 15 |
| 5 | egg   | 99 |      | 10 | jar   | 14 |*
+---+-------+---+       +----+-------+----+
     Table 1                  Table 2

So, if i use Table 1 to filter Table 2 i will get the output like this:

+----------------+
| A |   B   |  C |
+---+-------+----+     
| 7 | gun   | 55 |*     
| 8 | horse | 21 |*    
| 10| jar   | 14 |*       
+---+-------+----+     
    Table 3                   

Yes, i use the last column of Table 1 to Filter on Table 2

How can i filter it like this with any tool?

Upvotes: 0

Views: 1061

Answers (4)

NoChance
NoChance

Reputation: 5752

As per the comments above, here is what I did:

Create table T1 (A INT, B VARCHAR(100), C INT);

Create table T2 (A INT, B VARCHAR(100), C INT);

Insert into T1 Values (1, 'ant',14);
Insert into T1 Values (2, 'bird',11);
Insert into T1 Values (3, 'cat',21);
Insert into T1 Values (4, 'dog',55);
Insert into T1 Values (5, 'egg',99);

Insert into T2 Values (6, 'fan',12);
Insert into T2 Values (7, 'gun',55);
Insert into T2 Values (8, 'horse',21);
Insert into T2 Values (9, 'ice',15);
Insert into T2 Values (10, 'jar',14);

I am not sure if you have the data already in the table or not, there are tools to import csv files into your db.

If you are going to run the query often, then building an index on column A of each table would speed up the process. I did not build any indexes for this simple case.

The select required to get your result is:

select * from t2,t1 where t2.c = t1.c order by t2.a

If you are happy with the results you can put it in a table like this (SQL Server)

SELECT T2.A, T2.B , T2.C INTO TEST FROM t2,t1 where t2.c = t1.c order by t2.a

I hope this is what you want...

Upvotes: 1

Rubens
Rubens

Reputation: 14768

Assuming your input files do only contain the numbers and strings, without the additional table-formatting characters (just a \t separated file), you may do the following:

Table 1 (t1.txt):

1       ant     14
2       bird    11
3       cat     21
4       dog     55
5       egg     99

Table 2 (t2.txt):

6       Fan     12
7       gun     55
8       horse   21
9       ice     15
10      jar     14

And the execution and output:

$ join -j 3 -o '2.1,2.2,2.3' <(sort -k 3n t1.txt) <(sort -k 3n t2.txt) | sort -n
7 gun 55
8 horse 21
10 jar 14

Alternatively, if you have many columns, you may first extract only the filtering column, and later perform the selection itself. Also, you may output the join-input files without specifying an initial pattern (all the columns, in your case). After the execution you simply select the columns you actually want to display (using cut, for exampĺe):

$ join -2 15 <(cut -f 15 t1.txt | sort) <(sort -k 15 t2.txt) | sort -n

Notice cut uses \t as the default separator -- which can be redefined with flag -d <separator>. Anyway, as pointed by @EmmadKareem, you may be better off using a proper DB for this task -- as it will have dozens of optimizations, surely required by your hundred-thousand/million lines.

Upvotes: 1

potong
potong

Reputation: 58400

This might work for you (GNU sed):

sed -r 's/(\S+\s?){3}/\/(^\\S+\\s){2}\1$\/p/' file1.csv | sed -nrf - file2.csv

for a file that is space or tab separated.

For a file that is comma separated:

sed -r 's/([^,]+,?){3}/\/(^[^,]+,){2}\1$\/p/' file1.csv | sed -nrf - file2.csv

This works by creating a sed script from the first table and then uses it to filter against the second table.

Upvotes: 1

edi_allen
edi_allen

Reputation: 1872

Here is a script in Perl to do the job you want.

It works by scanning the first file and keeping the values of the third column in memory. Then it scans the second file and for every line read it compares the value of the third column with those in memory, if there is a match, it prints the line.

#!/usr/bin/perl
use warnings;
use strict;
use 5.010;

my %seen;

open my $file1_fh, '<', 'file1.txt' 
    or die "Can't open file1.txt $!";

while (<$file1_fh>) {
    chomp;
    $seen{ (split)[2] } = 1; #assumes line are delimited by whitespace.
}

close $file1_fh;

open my $file2_fh, '<', 'file2.txt'
    or die "Can't open file2.txt $!";


while (<$file2_fh>) {
    chomp;
    my $third_column_value = (split)[2]; #assumes line are delimited by whitespace.
    say if $seen{ $third_column_value };
}

close $file2_fh;

__END__

#OUTPUT
7 gun 55
8 horse 21
10 jar 14

Upvotes: 0

Related Questions