Reputation: 1979
I have a big file with numbers, for example:
cat $file
3120987654
3106982658
3420787642
3210957659
3320987654
3520987654
.
.
.
Daily I extract some numbers of the big file and save this date numbers in second file. Each day new numbers are added to the source data in my big file. I need to make a filter for the extracting job that ensures I do not extract numbers I have already extracted. How might I do this as bash
or python
script?
Note: I can not remove the numbers of from the source data "big file" I need it to remain intact, because when I finish extracting numbers from the file, I need the original + updated data for the next day's job. If I create a copy of the file and I remove the numbers of the copy, the new numbers that are added are not taken into consideration.
Upvotes: 1
Views: 233
Reputation: 6378
You can save a sorted version of your source files and extracted data to temporary files and you could use a standard POSIX tool like comm
to show the common lines/records. Those lines record would be the basis of the "filter" you'd use in your subsequent extract jobs. If you are extracting records from the source.txt
file with $SHELL
commands then something like grep -v [list of common lines]
would be part of your script -a long with whatever other criteria you are using for extracting the records. For best results the source.txt
and extracted.txt
files should be sorted.
Here's a quick cut and paste of typical comm
output. The sequence shows the "Big File", the extracted data, and then the final comm
command which shows lines unique to the source.txt
file (see man comm(1)
for how comm
works). Following that is an example of searching using an arbitrary pattern with grep
and as a "filter" excluding the common files.
% cat source.txt
3120987654
3106982658
3420787642
3210957659
3320987654
3520987654
3520987754
3520987954
3520988654
3520987444
% cat extracted.txt
3120987654
3106982658
3420787642
3210957659
3320987654
% comm -2 -3 source.txt extracted.txt # show lines only in source.txt
3520987754
3520987954
3520988654
3520987444
comm
selects or rejects lines common to two files. The utility conforms to IEEE Std 1003.2-1992 (“POSIX.2”). We can save its output for use with grep
:
% comm -1 -2 source.txt extracted.txt | sort > common.txt
% grep -v -f common.txt source.txt | grep -E ".*444$"
This would grep
the source.txt
files and exclude lines common to source.txt
and extracted.txt
; then pipe (|
) and grep
these "filtered" results for a new record to extract (in this case a line or lines ending in "444"). If the files are very large or if you want to preserve the order of the numbers in original file and the extracted data, then the question is more complex and the response will need to be more elaborate.
See my other response or the start of a simplistic alternative approach that uses perl
.
Upvotes: 1
Reputation: 6378
Lazyish perl
approach.
Just write your own selection()
subroutine to replace grep {/.*444$/}
;-)
#!/usr/bin/env perl
use strict; use warnings; use autodie;
use 5.16.0 ;
use Tie::File;
use Array::Utils qw(:all);
tie my @source, 'Tie::File', 'source.txt' ;
tie my @extracted, 'Tie::File', 'extracted.txt' ;
# Find the intersection
my @common = intersect(@source, @extracted);
say "Numbers already extracted";
say for @common
untie @@source;
untie @extracted;
Once the source.txt
file has been updated you could select from it:
#!/usr/bin/env perl
use strict; use warnings; use autodie;
use 5.16.0 ;
use Tie::File;
use Array::Utils qw(:all);
tie my @source, 'Tie::File', 'source.txt' ;
tie my @extracted, 'Tie::File', 'extracted.txt' ;
# Find the intersection
my @common = intersect(@source, @extracted);
# Select from source.txt excluding numbers already selected:
my @newselect = array_minus(@source, @common);
say "new selection:";
# grep returns list $selection needs "()" for list context.
my ($selection) = grep {/.*444$/} @newselect;
push @extracted, $selection ;
say "updated extracted.txt" ;
untie @@source;
untie @extracted;
This uses two modules ... succinct and idiomatic versions welcome!
Upvotes: 0
Reputation: 5864
I think you're not asking for unique values, but you want all the new values added since the last time you looked at the file?
Assume the BigFile gets new data all the time.
We want DailyFilemm_dd_yy to contain the new numbers received during the previous 24 hours.
This script will do what you want. Run it each day.
BigFile=bigfile
DailyFile=dailyfile
today=$(date +"%m_%d_%Y")
# Get the month, day, year for yesterday.
yesterday=$(date -jf "%s" $(($(date +"%s") - 86400)) +"%m_%d_%Y")
cp $BigFile $BigFile$today
comm -23 $BigFile $BigFile$yesterday > $DailyFile$today
rm $BigFile$yesterday
comm
shows the lines not in both files.
Example of comm:
#values added to big file
echo '111
222
333' > big
cp big yesterday
# New values added to big file over the day
echo '444
555' >> big
# Find out what values were added.
comm -23 big yesterday > today
cat today
444
555
Upvotes: 0
Reputation: 1122222
Read in all numbers from the big file into a set, then test new numbers against that:
with open('bigfile.txt') as bigfile:
existing_numbers = {n.strip() for n in bigfile}
with open('newfile.txt') as newfile, open('bigfile.txt', 'w') as bigfile:
for number in newfile:
number = number.strip()
if number not in existing_numbers:
bigfile.write(number + '\n')
This adds numbers not already in bigfile
to the end, in as efficient a way as possible.
If bigfile
becomes too big for the above to run efficiently, you may need to use a database instead.
Upvotes: 2