Reputation:

Remove duplicate lines from multiple text files

I have 100 .txt files with ~ 1 mln lines each.

Is there a way to open all the files, remove duplicates and save the lines accordingly to each file (php/unix etc)?

For example:

file1.txt contents

Something here1
Something here2

file2.txt contents

Something here2
Something here3

After removal:

file1.txt contents

Something here1
Something here2

file2.txt contents

Something here 3

Upvotes: 2

Answers (3)

Gowri

Reputation: 16835

get the each line of the file as array

$file1 = explode("\n", file_get_contents('file1.txt')); 
$file2 = explode("\n", file_get_contents('file2.txt'));

use array_unique to remove the duplicates with in the file

$f1 = array_unique($file1); 
$f2 = array_unique($file2);

Remove duplicates from second array compare with first

$new_f2 = array_diff($f2,$f1);

Now you have $f1 and $new_f2 unique values.

Now just update the files.

Note: for multiple files do this recursively

Upvotes: 0

brianmearns

Reputation: 9967

I tested this, it works. Order of lines is not maintained within each file, but you said in a comment that you're already applying sort, so that doesn't matter. It's a little roundabout, but it does work:

   #!/bin/bash

   #The number of files you have, named like file1.txt, file2.txt, etc.
   # If named otherwise, cahnge the definition of variable "file" in the loop below.
   NUM_FILES=3

   #These files will be created and removed during the script, so make sure they're
   # not files you already have around.
   tempfile1="_all.txt"
   tempfile2="_tmp.txt"

   sort -u file1.txt > file1out.txt
   cat file1out.txt > $tempfile1

   for i in $(seq 2 $NUM_FILES)
   do
       prev=$((i-1))
       pofile="file${prev}out.txt"
       file="file$i.txt"
       ofile="file${i}out.txt"

       echo "Input files: $file $pofile"
       echo "Output file: $ofile"
       cat $tempfile1 $pofile > $tempfile2
       sort -u $tempfile2 > $tempfile1
       sort -u $file | comm -23 - $tempfile1 > $ofile
   done

   rm -f $tempfile1 $tempfile2

Upvotes: 0

anubhava

Reputation: 785196

Using Unix sort & grep:

If order of lines doesn't matter:

 sort -u file1.txt > _temp && mv _temp file1.txt

If order of lines matters:

 awk 'FNR==NR{a[$0];next} ($0 in a) {delete a[$0]; print}' file1.txt file1.txt _temp && mv _temp file1.txt
 grep -v -f file1.txt file2.txt > _temp && mv _temp file2.txt

Upvotes: 1

Remove duplicate lines from multiple text files

Answers (3)

Related Questions