Reputation: 1315
I have 2 text files. file1
contains a list of IDs:
11002
10995
48981
79600
file2
:
10993 item 0
11002 item 6
10995 item 7
79600 item 7
439481 item 5
272557 item 7
224325 item 7
84156 item 6
572546 item 7
693661 item 7
.....
I am trying to select all lines from file2
where the ID (first column) is in file1
. Currently, what I am doing is to loop through the first file to create a regex like:
^\b11002\b\|^\b10995\b\|^\b48981\b|^\b79600\b
Then run:
grep '^11002\|^10995\|^48981|^79600' file2.txt
But when the number of IDs in file1
is too large (~2000), the regular expression becomes quite long and grep
becomes slow. Is there another way? I am using Perl + Awk + Unix.
Upvotes: 4
Views: 9936
Reputation: 67910
Simple perl solution is to use a hash and count the number of occurrences of the sought after numbers.
perl -lanwe 'print if $a{$F[0]}++ == 1;' file1.txt file2.txt
I get the following output from your sample data:
11002 item 6
10995 item 7
79600 item 7
Note that you need to use the files in the correct order on the command line.
This will open and read the input file names (-n
), autosplit the lines (-a
) into @F
, and then print each line, if the value in the hash for that number is equal to 1. If you want to print multiple values from file2, simply change == 1
to >= 1
.
Note that the ++
operator is applied after the equality comparison is done.
Upvotes: 0
Reputation: 96976
Use a hash table. It can be memory-intensive but lookups are in constant time. This is an efficient and correct procedure — not the only one, but efficient and correct — for creating a hash table, using file1
as keys and file2
for looking up keys in the hash table. If a key is in the hash table, the line is printed to standard output:
#!/usr/bin/env perl
use strict;
use warnings;
open FILE1, "< file1" or die "could not open file1\n";
my $keyRef;
while (<FILE1>) {
chomp;
$keyRef->{$_} = 1;
}
close FILE1;
open FILE2, "< file2" or die "could not open file2\n";
while (<FILE2>) {
chomp;
my ($testKey, $label, $count) = split("\t", $_);
if (defined $keyRef->{$testKey}) {
print STDOUT "$_\n";
}
}
close FILE2;
There are lots of ways to do the same thing in Perl. That said, I value clarity and explicitness over fancy obscurity, because you never know when you have to come back to a Perl script and make changes, and they are hard enough to manage, as it is. One person's opinion.
Upvotes: 6
Reputation: 247042
Use a process substitution to transform the ID's in file1 into regular expressions:
grep -f <(sed 's/.*/^&\\b/' file1) file2
I'm assuming you're using bash or a similarly capable shell
Upvotes: 0
Reputation: 85865
Using grep
:
$ grep -f f1 f2
11002 item 6
10995 item 7
79600 item 7
Note: I tested a lot of the suggested answer on multiple system and some only display the last match 79600 item 7
!?
Upvotes: 2
Reputation: 204184
awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2
Look:
$ cat file1
11002
10995
48981
79600
$ cat file2
10993 item 0
11002 item 6
10995 item 7
79600 item 7
439481 item 5
272557 item 7
224325 item 7
84156 item 6
572546 item 7
693661 item 7
$ awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2
11002 item 6
10995 item 7
79600 item 7
Upvotes: 4
Reputation: 1572
I would suggest using a tool designed to do just that. Use the join command. Do 'man join' for more info.
linux_prompt> join file1 file2
11002 item 6
10995 item 7
79600 item 7
Upvotes: 3
Reputation: 3225
Load all the elements of your first file into a hash.
For each line of the second file,
extract the number using the regex ^(\d*)
if the hash contains the extracted number, print it
Upvotes: 1