user2020202
user2020202

Reputation: 11

Remove from one file what is in another file

I have two text files, file1.txt and file2.txt.

file1.txt contains a list of numbers. file2.txt also contains a list of numbers, but more of them (a good chunk are numbers from file1.txt). This is what I am trying to do:

I want to remove all the numbers in file1.txt from file2.txt and have the output saved to file3.txt. So in file3.txt, it will contain no numbers from file1.txt. How can I accomplish this?

Upvotes: 1

Views: 2996

Answers (6)

Thor
Thor

Reputation: 47099

You want to only print unique elements of file2.txt. This is what the comm utility is designed for:

comm -13 <(sort file1.txt) <(sort file2.txt)

Testing

$ cat file1.txt
5
4
6
2
10

$ cat file2.txt
3
7
8
2
4
1
9
10
5
6

$ comm -13 <(sort file1.txt) <(sort file2.txt)
1
3
7
8
9

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 753615

With GNU grep, you can use the 'fgrep' mode:

grep -F -v -f file1.txt -w file2.txt > file3.txt

Demo:

seq 1 30 > file2.txt
for i in 1 2 3 4 5; do echo $RANDOM; done | sed 's/\(..\).*/\1/' > file1.txt
grep -F -v -f file1.txt -w file2.txt > file3.txt

The contents of file2.txt is lines with numbers 1 through 30. The content of file1.txt is 5 semi-random 2-digit numbers. The output in file3.txt is the lines in file 2 that are not in file 1. Note that the random number generated by the loop are not very good, nor constrained to 1..30 (see also comments just below).

The feature that is specific to GNU grep is the -w flag, which matches whole words. Interestingly, POSIX 2008 specifies that -x should match exact lines, and the -x option works correctly for me (on Mac OS X 10.7.5, but /usr/bin/grep is GNU grep 2.5.1). In theory, the -x is more portable. Since it was in the POSIX 1997 standard too, it should be widely available. The -w option would be more appropriate if there were multiple numbers on a single line (but grep would eliminate whole lines).

Upvotes: 4

Steve
Steve

Reputation: 54392

Here's one way using awk:

awk 'FNR==NR { a[$0]; next } !($0 in a)' file1.txt file2.txt > file3.txt

This reads file1 into an array, then when iterating through file2, it will print lines of file2 that are not in the array and write them to an output file. If you have any questions, don't hesitate to ask. Cheers.

Upvotes: 6

Roguebantha
Roguebantha

Reputation: 824

Can you give a little more information about how these numbers are formatted? Are each of them on a new line? Are they all the same number of digits?

EDIT: After receiving comment:

while read line
do
    bool="false"
    while read secLine
    do
        if [ "$line" == "$secLine" ]
        then
            bool="true"
        fi
    done <file1
    if [ "$bool" == "false" ]
    then
         echo $line >> file3.txt
    fi
done <file2

That will work, albeit by brute force (or it should work. Check for syntax errors. I didn't see any but there may be some.) It may take awhile depending on how many numbers you have.

Upvotes: 0

Manjula
Manjula

Reputation: 5091

You can use the unix "diff" command for get the difference and filter out unwanted lines. You can use --changed-group-format and --unchanged-group-format options to filter required data.

Following three options can use to select the relevant group for each option:

  • '%<' get lines from FILE1

  • '%>' get lines from FILE2

  • '' (empty string) for removing lines from both files.

e.g:

diff --changed-group-format="%>" --unchanged-group-format="" file1.txt file2.txt > file3.txt

Upvotes: 1

Stephen Niedzielski
Stephen Niedzielski

Reputation: 2637

sort file1.txt file2.txt|uniq -u > file3.txt

Upvotes: 1

Related Questions