zara
zara

Reputation: 1108

how to extract common lines across multiple file?

I have 15 different files that I want have a new file which include only common lines in all of them. for example:

File1:

id1
id2
id3

file2:

id2
id3
id4

file3:
id10
id2
id3

file4

id100
id45
id3
id2

I need the output be like:

newfile:

id2 
id3

I know this command works for each pair of files:

grep -w -f file1 file2 > output

but i need a command to works for more than 2 files.

any suggestion please?

Upvotes: 3

Views: 388

Answers (4)

Sundeep
Sundeep

Reputation: 23697

The zet command provides set operations between input files. Use the intersect option to get common lines across all the input files. The input content doesn't have to be sorted. The output order will be same as the order of input lines.

$ zet intersect file1 file2 file3 file4
id2
id3

Here's some relevant details from the notes section:

  • Each output line occurs only once, because we're treating the files as sets and the lines as their elements.
  • Zet reads entire files into memory. Its memory usage is roughly proportional to the file size of its largest argument plus the size of the (eventual) output.

Upvotes: 0

choroba
choroba

Reputation: 242373

Perl to the rescue:

perl -lne 'BEGIN { $count = @ARGV }
           $h{$_}{$ARGV} = 1;
           }{
           print $_ for grep $count == keys %{ $h{$_} }, keys %h
           ' file* > newfile
  • -n reads the input files line by line
  • -l adds a newline to print
  • the @ARGV array contains the input file names, assigning it to $count at the BEGINning just counts them
  • $ARGV contains the name of the current input file
  • $_ contains the current line read from the file.
  • the %h hash contains ids as keys, each key contains a hash reference with file names that contained the id as keys
  • }{ is the "Eskimo greeting" operator, it introduces code that runs once the input is exhausted
  • we only output the ids whose number of files is equivalent to the number of all files. It works for any number of files.

Upvotes: 6

John1024
John1024

Reputation: 113994

Using grep

The same trick can be used more than once:

$ grep -w -f file1 file2 | grep -w -f file3 | grep -w -f file4
id2
id3

By the way, if you are looking for exact matches, not a regular expression matches, it is better and faster to use the -F flag:

$ grep -wFf file1 file2 | grep -wFf file3 | grep -wFf file4
id2
id3

Using awk

$ awk 'FNR==1{nfiles++; delete fseen} !($0 in fseen){fseen[$0]++; seen[$0]++} END{for (key in seen) if (seen[key]==nfiles) print key}' file1 file2 file3 file4
id3
id2
  • FNR==1{nfiles++; delete fseen}

    Every time that we start reading a new file, we do two things: (1) increment the file counter, nfiles. and (2) delete the array fseen.

  • !($0 in fseen){fseen[$0]; seen[$0]++}

    If the current line is not a key in fseen, then add it to fseen and increment the count for this line in seen.

  • END{for (key in seen) if (seen[key]==nfiles) print key}

    After we have read the last line of the last file, we look at every key in seen. If the count for that key is equal to the number of files that we have read, nfiles, then we print that key.

Upvotes: 4

P....
P....

Reputation: 18411

     grep -hxf file1 file2 file3 file4 |sort -u
     id2
     id3

     # For storing it to any file, 
     grep -hxf file1 file2 file3 file4 |sort -u > output.txt

Upvotes: 1

Related Questions