user3845185
user3845185

Reputation: 125

Compare two files content

I have two files test1.txt and test2.txt

test1.txt contains

abc.cde.ccd.eed.12345.5678.txt
abcd.cdde.ccdd.eaed.12346.5688.txt
aabc.cade.cacd.eaed.13345.5078.txt
abzc.cdae.ccda.eaed.29345.1678.txt
abac.cdae.cacd.eead.18145.2678.txt
aabc.cdve.cncd.ened.19945.2345.txt

and test2.txt contains

12345.5678.txt
29345.1678.txt
18145.2678.txt
10111.2222.txt

I want to compare these two files and give me output something like this in bash

In both:

abc.cde.ccd.eed.12345.5678.txt
abzc.cdae.ccda.eaed.29345.1678.txt
abac.cdae.cacd.eead.18145.2678.txt

Only in test1.txt

abcd.cdde.ccdd.eaed.12346.5688.txt
aabc.cade.cacd.eaed.13345.5078.txt
aabc.cdve.cncd.ened.19945.2345.txt

Only in test2.txt

10111.2222.txt

Upvotes: 2

Views: 357

Answers (4)

Marcus Rickert
Marcus Rickert

Reputation: 4238

The following AWK script script.awk also does the job:

NR == FNR { lines[++i] = $0 }

NR > FNR { patterns[++j] = $0 }

END {
    for (p_index in patterns)
        for (l_index in lines)
            if (index(lines[l_index], patterns[p_index]) > 0) {
                lines_match[l_index] = 1
                patterns_match[p_index] = 1
            }

    print "Lines only in first file:"
    for (l_index in lines)
        if (!(l_index in lines_match)) 
            print lines[l_index]

    print "Lines only in second file:"
    for (p_index in patterns)
        if (! (p_index in patterns_match)) 
            print patterns[p_index]

    print "Lines in both files:"
    for (l_index in lines)
        if (l_index in lines_match)
            print lines[l_index]
}

It can be called as follows:

awk -f script.awk test1.txt test2.txt

Note that the script does not make any assumptions about the structure of the data in the two files. It simply assumes that the lines in test2.txt are potential substrings of the lines in test1.txt.

Upvotes: 0

Rasim
Rasim

Reputation: 1296

This formulation might be solved using comm from GNU Coreutils:

Sort second file at first:

sort -o test2.txt test2.txt;

Then use commands to show lines:

# unique to test1.txt
cut -d '.' -f 1-4 --complement test1.txt | sort | comm -23 - test2.txt
# unique to test2.txt
cut -d '.' -f 1-4 --complement test1.txt | sort | comm -13 - test2.txt
# that appear in both files
cut -d '.' -f 1-4 --complement test1.txt | sort | comm -12 - test2.txt

Explanation:

# 1. Extract all but first four fields from test1.txt
cut -d '.' -f 1-4 --complement test1.txt
# 2. Here '-' replaces standard input
comm -3 - test2.txt

Upvotes: 0

Arjun Mathew Dan
Arjun Mathew Dan

Reputation: 5298

File1 :
abc.cde.ccd.eed.12345.5678.txt
abcd.cdde.ccdd.eaed.12346.5688.txt
aabc.cade.cacd.eaed.13345.5078.txt
abzc.cdae.ccda.eaed.29345.1678.txt
abac.cdae.cacd.eead.18145.2678.txt
aabc.cdve.cncd.ened.19945.2345.txt


File2 :
12345.5678.txt
29345.1678.txt
18145.2678.txt
10111.2222.txt



#!/bin/bash

if [ -e Both.txt ]
then
  rm Both.txt
fi

if [ -e File1.txt ]
then
  rm File1.txt
fi

if [ -e File2.txt ]
then
  rm File2.txt
fi

while read f2line
do
  found=0
  while read f1line
  do
    Both=`echo "$f1line" | grep "$f2line"`
    if [ $? -eq 0 ]
    then
      found=1
      echo $Both >> Both.txt
    fi
  done < File1
if [ $found -eq 0 ]
then
  echo $f2line >> File2.txt
fi
done < File2

sort Both.txt > s_Both.txt
sort File1 > s_File1
comm -3 s_File1 s_Both.txt > File1.txt
rm s_File1
rm s_Both.txt

Output Files: Both.txt, File1.txt, File2.txt

Upvotes: 0

Cyrus
Cyrus

Reputation: 88646

In both:

grep -f text2.txt text1.txt

Output:

abc.cde.ccd.eed.12345.5678.txt
abzc.cdae.ccda.eaed.29345.1678.txt
abac.cdae.cacd.eead.18145.2678.txt


Only in test1.txt:

grep -v -f text2.txt text1.txt

Output:

abcd.cdde.ccdd.eaed.12346.5688.txt
aabc.cade.cacd.eaed.13345.5078.txt
aabc.cdve.cncd.ened.19945.2345.txt


Only in test2.txt:

grep -v -f <( grep -Eo '[0-9]+.[0-9]+.txt' text1.txt) text2.txt

Output:

10111.2222.txt

Upvotes: 3

Related Questions