Reputation: 615
I have 2 files. First file contains the list of row ID's of tuples of a table in the database. And second file contains SQL queries with these row ID's in "where" clause of the query.
For example:
File 1
1610657303
1610658464
1610659169
1610668135
1610668350
1610670407
1610671066
File 2
update TABLE_X set ATTRIBUTE_A=87 where ri=1610668350;
update TABLE_X set ATTRIBUTE_A=87 where ri=1610672154;
update TABLE_X set ATTRIBUTE_A=87 where ri=1610668135;
update TABLE_X set ATTRIBUTE_A=87 where ri=1610672153;
I have to read File 1 and search in File 2 for all the SQL commands which matches the row ID's from File 1 and dump those SQL queries in a third file.
File 1 has 1,00,000 entries and File 2 contains 10 times the entries of File 1 i.e. 1,00,0000.
I used grep -f File_1 File_2 > File_3
. But this is extremely slow and the rate is 1000 entries per hour.
Is there any faster way to do this?
Upvotes: 41
Views: 90250
Reputation: 22402
Most of previous answers are correct but the only thing that worked for me was this command
grep -oi -f a.txt b.txt
Upvotes: 1
Reputation:
## reports any lines contained in < file 1> missing in < file 2>
IFS=$(echo -en "\n\b") && for a in $(cat < file 1>);
do ((\!$(grep -F -c -- "$a" < file 2>))) && echo $a;
done && unset IFS
or to do what the asker wants, take off the negation and redirect
(IFS=$(echo -en "\n\b") && for a in $(cat < file 1>);
do (($(grep -F -c -- "$a" < file 2>))) && echo $a;
done && unset IFS) >> < file 3>
Upvotes: -1
Reputation: 77185
One way with awk
:
awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2
This should be pretty quick. On my machine, it took under 2 seconds to create a lookup of 1 million entries and compare it against 3 million lines.
Machine Specs:
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (8 cores)
98 GB RAM
Upvotes: 18
Reputation: 3144
The awk/grep solutions mentioned above were slow or memory hungry on my machine (file1 10^6 rows, file2 10^7 rows). So I came up with an SQL solution using sqlite3.
Turn file2 into a CSV-formatted file where the first field is the value after ri=
cat file2.txt | gawk -F= '{ print $3","$0 }' | sed 's/;,/,/' > file2_with_ids.txt
Create two tables:
sqlite> CREATE TABLE file1(rowId char(10));
sqlite> CREATE TABLE file2(rowId char(10), statement varchar(200));
Import the row IDs from file1:
sqlite> .import file1.txt file1
Import the statements from file2, using the "prepared" version:
sqlite> .separator ,
sqlite> .import file2_with_ids.txt file2
Select all and ony the statements in table file2
with a matching rowId in table file1
:
sqlite> SELECT statement FROM file2 WHERE file2.rowId IN (SELECT file1.rowId FROM file1);
File 3 can be easily created by redirecting output to a file before issuing the select statement:
sqlite> .output file3.txt
Test data:
sqlite> select count(*) from file1;
1000000
sqlite> select count(*) from file2;
10000000
sqlite> select * from file1 limit 4;
1610666927
1610661782
1610659837
1610664855
sqlite> select * from file2 limit 4;
1610665680|update TABLE_X set ATTRIBUTE_A=87 where ri=1610665680;
1610661907|update TABLE_X set ATTRIBUTE_A=87 where ri=1610661907;
1610659801|update TABLE_X set ATTRIBUTE_A=87 where ri=1610659801;
1610670610|update TABLE_X set ATTRIBUTE_A=87 where ri=1610670610;
Without creating any indices, the select statement took about 15 secs on an AMD A8 1.8HGz 64bit Ubuntu 12.04 machine.
Upvotes: 0
Reputation: 15954
I suggest using a programming language such as Perl, Ruby or Python.
In Ruby, a solution reading both files (f1
and f2
) just once could be:
idxes = File.readlines('f1').map(&:chomp)
File.foreach('f2') do | line |
next unless line =~ /where ri=(\d+);$/
puts line if idxes.include? $1
end
or with Perl
open $file, '<', 'f1';
while (<$file>) { chomp; $idxs{$_} = 1; }
close($file);
open $file, '<', 'f2';
while (<$file>) {
next unless $_ =~ /where ri=(\d+);$/;
print $_ if $idxs{$1};
}
close $file;
Upvotes: 1
Reputation: 94549
I may be missing something, but wouldn't it be sufficient to just iterate the IDs in file1
and for each ID, grep file2
and store the matches in a third file? I.e.
for ID in `cat file1`; do grep $ID file2; done > file3
This is not terribly efficient (since file2 will be read over and over again), but it may be good enough for you. If you want more speed, I'd suggest to use a more powerful scripting language which lets you read file2
into a map which quickly allows identifying lines for a given ID.
Here's a Python version of this idea:
queryByID = {}
for line in file('file2'):
lastEquals = line.rfind('=')
semicolon = line.find(';', lastEquals)
id = line[lastEquals + 1:semicolon]
queryByID[id] = line.rstrip()
for line in file('file1'):
id = line.rstrip()
if id in queryByID:
print queryByID[id]
Upvotes: 0
Reputation: 653
Maybe try AWK and use number from file 1 as a key for example simple script
First script will produce awk script:
awk -f script1.awk
{ print "\$0 ~ ",$0,"{ print \$0 }" > script2.awk; }
and then invoke script2.awk with file
Upvotes: -1