Reputation:
I have 100 .txt files with ~ 1 mln lines each.
Is there a way to open all the files, remove duplicates and save the lines accordingly to each file (php/unix etc)?
For example:
file1.txt contents
Something here1
Something here2
file2.txt contents
Something here2
Something here3
After removal:
file1.txt contents
Something here1
Something here2
file2.txt contents
Something here 3
Upvotes: 2
Views: 2537
Reputation: 16835
$file1 = explode("\n", file_get_contents('file1.txt')); $file2 = explode("\n", file_get_contents('file2.txt'));
$f1 = array_unique($file1); $f2 = array_unique($file2);
$new_f2 = array_diff($f2,$f1);
Now you have $f1 and $new_f2 unique values.
Now just update the files.
Note: for multiple files do this recursively
Upvotes: 0
Reputation: 9967
I tested this, it works. Order of lines is not maintained within each file, but you said in a comment that you're already applying sort
, so that doesn't matter. It's a little roundabout, but it does work:
#!/bin/bash
#The number of files you have, named like file1.txt, file2.txt, etc.
# If named otherwise, cahnge the definition of variable "file" in the loop below.
NUM_FILES=3
#These files will be created and removed during the script, so make sure they're
# not files you already have around.
tempfile1="_all.txt"
tempfile2="_tmp.txt"
sort -u file1.txt > file1out.txt
cat file1out.txt > $tempfile1
for i in $(seq 2 $NUM_FILES)
do
prev=$((i-1))
pofile="file${prev}out.txt"
file="file$i.txt"
ofile="file${i}out.txt"
echo "Input files: $file $pofile"
echo "Output file: $ofile"
cat $tempfile1 $pofile > $tempfile2
sort -u $tempfile2 > $tempfile1
sort -u $file | comm -23 - $tempfile1 > $ofile
done
rm -f $tempfile1 $tempfile2
Upvotes: 0
Reputation: 785196
Using Unix sort & grep:
If order of lines doesn't matter:
sort -u file1.txt > _temp && mv _temp file1.txt
If order of lines matters:
awk 'FNR==NR{a[$0];next} ($0 in a) {delete a[$0]; print}' file1.txt file1.txt _temp && mv _temp file1.txt
grep -v -f file1.txt file2.txt > _temp && mv _temp file2.txt
Upvotes: 1