jjennifer
jjennifer

Reputation: 1315

extract all lines from text file based on a given list of IDs

I have 2 text files. file1 contains a list of IDs:

11002
10995
48981
79600

file2:

10993   item    0
11002   item    6
10995   item    7
79600   item    7
439481  item    5
272557  item    7
224325  item    7
84156   item    6
572546  item    7
693661  item    7
.....

I am trying to select all lines from file2 where the ID (first column) is in file1. Currently, what I am doing is to loop through the first file to create a regex like:

^\b11002\b\|^\b10995\b\|^\b48981\b|^\b79600\b

Then run:

grep '^11002\|^10995\|^48981|^79600' file2.txt

But when the number of IDs in file1 is too large (~2000), the regular expression becomes quite long and grep becomes slow. Is there another way? I am using Perl + Awk + Unix.

Upvotes: 4

Views: 9936

Answers (7)

TLP
TLP

Reputation: 67910

Simple perl solution is to use a hash and count the number of occurrences of the sought after numbers.

perl -lanwe 'print if $a{$F[0]}++ == 1;' file1.txt file2.txt

I get the following output from your sample data:

11002   item    6
10995   item    7
79600   item    7

Note that you need to use the files in the correct order on the command line.

This will open and read the input file names (-n), autosplit the lines (-a) into @F, and then print each line, if the value in the hash for that number is equal to 1. If you want to print multiple values from file2, simply change == 1 to >= 1.

Note that the ++ operator is applied after the equality comparison is done.

Upvotes: 0

Alex Reynolds
Alex Reynolds

Reputation: 96976

Use a hash table. It can be memory-intensive but lookups are in constant time. This is an efficient and correct procedure — not the only one, but efficient and correct — for creating a hash table, using file1 as keys and file2 for looking up keys in the hash table. If a key is in the hash table, the line is printed to standard output:

#!/usr/bin/env perl

use strict;
use warnings;

open FILE1, "< file1" or die "could not open file1\n";
my $keyRef;
while (<FILE1>) {
   chomp;
   $keyRef->{$_} = 1;
}
close FILE1;

open FILE2, "< file2" or die "could not open file2\n";
while (<FILE2>) {
    chomp;
    my ($testKey, $label, $count) = split("\t", $_);
    if (defined $keyRef->{$testKey}) {
        print STDOUT "$_\n";
    }
}
close FILE2;

There are lots of ways to do the same thing in Perl. That said, I value clarity and explicitness over fancy obscurity, because you never know when you have to come back to a Perl script and make changes, and they are hard enough to manage, as it is. One person's opinion.

Upvotes: 6

glenn jackman
glenn jackman

Reputation: 247042

Use a process substitution to transform the ID's in file1 into regular expressions:

grep -f <(sed 's/.*/^&\\b/' file1) file2

I'm assuming you're using bash or a similarly capable shell

Upvotes: 0

Chris Seymour
Chris Seymour

Reputation: 85865

Using grep:

$ grep -f f1 f2
11002   item    6
10995   item    7
79600   item    7

Note: I tested a lot of the suggested answer on multiple system and some only display the last match 79600 item 7!?

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 204184

awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2

Look:

$ cat file1
11002
10995
48981
79600
$ cat file2
10993   item    0
11002   item    6
10995   item    7
79600   item    7
439481  item    5
272557  item    7
224325  item    7
84156   item    6
572546  item    7
693661  item    7
$ awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2
11002   item    6
10995   item    7
79600   item    7

Upvotes: 4

Arun Taylor
Arun Taylor

Reputation: 1572

I would suggest using a tool designed to do just that. Use the join command. Do 'man join' for more info.

linux_prompt> join file1 file2
11002 item 6
10995 item 7
79600 item 7

Upvotes: 3

Mel Nicholson
Mel Nicholson

Reputation: 3225

Load all the elements of your first file into a hash. For each line of the second file, extract the number using the regex ^(\d*) if the hash contains the extracted number, print it

Upvotes: 1

Related Questions