Reputation: 1979

Extract lines from a file only once using shell/python/perl

I have a big file with numbers, for example:

Daily I extract some numbers of the big file and save this date numbers in second file. Each day new numbers are added to the source data in my big file. I need to make a filter for the extracting job that ensures I do not extract numbers I have already extracted. How might I do this as bash or python script?

Note: I can not remove the numbers of from the source data "big file" I need it to remain intact, because when I finish extracting numbers from the file, I need the original + updated data for the next day's job. If I create a copy of the file and I remove the numbers of the copy, the new numbers that are added are not taken into consideration.

Upvotes: 1

Answers (4)

G. Cito

Reputation: 6378

You can save a sorted version of your source files and extracted data to temporary files and you could use a standard POSIX tool like comm to show the common lines/records. Those lines record would be the basis of the "filter" you'd use in your subsequent extract jobs. If you are extracting records from the source.txt file with $SHELL commands then something like grep -v [list of common lines] would be part of your script -a long with whatever other criteria you are using for extracting the records. For best results the source.txt and extracted.txt files should be sorted.

Here's a quick cut and paste of typical comm output. The sequence shows the "Big File", the extracted data, and then the final comm command which shows lines unique to the source.txt file (see man comm(1) for how comm works). Following that is an example of searching using an arbitrary pattern with grep and as a "filter" excluding the common files.

% cat source.txt                           
3120987654
3106982658
3420787642
3210957659
3320987654
3520987654
3520987754
3520987954
3520988654
3520987444

% cat extracted.txt 
3120987654
3106982658
3420787642
3210957659
3320987654

% comm -2 -3 source.txt extracted.txt  # show lines only in source.txt
3520987754
3520987954
3520988654
3520987444

comm selects or rejects lines common to two files. The utility conforms to IEEE Std 1003.2-1992 (“POSIX.2”). We can save its output for use with grep:

% comm -1 -2 source.txt extracted.txt | sort > common.txt
% grep -v -f common.txt source.txt | grep -E ".*444$"

This would grep the source.txt files and exclude lines common to source.txt and extracted.txt; then pipe (|) and grep these "filtered" results for a new record to extract (in this case a line or lines ending in "444"). If the files are very large or if you want to preserve the order of the numbers in original file and the extracted data, then the question is more complex and the response will need to be more elaborate.

See my other response or the start of a simplistic alternative approach that uses perl.

Upvotes: 1

G. Cito

Reputation: 6378

Lazyish perl approach.

Just write your own selection() subroutine to replace grep {/.*444$/} ;-)

#!/usr/bin/env perl  
use strict; use warnings; use autodie;                      
use 5.16.0 ; 

use Tie::File;        
use Array::Utils qw(:all); 

tie my @source, 'Tie::File', 'source.txt' ;               
tie my @extracted, 'Tie::File', 'extracted.txt' ;

# Find the intersection                                                   
my @common = intersect(@source, @extracted);                      

say "Numbers already extracted"; 
say for @common       

untie @@source;
untie @extracted;

Once the source.txt file has been updated you could select from it:

#!/usr/bin/env perl  
use strict; use warnings; use autodie;              
use 5.16.0 ; 

use Tie::File;        
use Array::Utils qw(:all); 

tie my @source, 'Tie::File', 'source.txt' ;               
tie my @extracted, 'Tie::File', 'extracted.txt' ;

# Find the intersection                                                   
my @common = intersect(@source, @extracted);                      

# Select from source.txt excluding numbers already selected:
my @newselect = array_minus(@source, @common);
say "new selection:";
# grep returns list $selection needs "()" for list context.
my ($selection) = grep {/.*444$/} @newselect; 
push @extracted, $selection ;
say "updated extracted.txt" ; 

untie @@source;
untie @extracted;

This uses two modules ... succinct and idiomatic versions welcome!

Upvotes: 0

Jonathan

Reputation: 5864

I think you're not asking for unique values, but you want all the new values added since the last time you looked at the file?

Assume the BigFile gets new data all the time.

We want DailyFilemm_dd_yy to contain the new numbers received during the previous 24 hours.

This script will do what you want. Run it each day.

BigFile=bigfile
DailyFile=dailyfile
today=$(date +"%m_%d_%Y")
# Get the month, day, year for yesterday.
yesterday=$(date -jf "%s" $(($(date +"%s") - 86400)) +"%m_%d_%Y")

cp $BigFile $BigFile$today
comm -23 $BigFile $BigFile$yesterday > $DailyFile$today
rm $BigFile$yesterday

comm shows the lines not in both files.

Example of comm:

#values added to big file
echo '111
222
333' > big

cp big yesterday

# New values added to big file over the day
echo '444
555' >> big

# Find out what values were added.
comm -23 big yesterday > today
cat today

output

444
555

Upvotes: 0

Martijn Pieters

Reputation: 1122222

Read in all numbers from the big file into a set, then test new numbers against that:

with open('bigfile.txt') as bigfile:
    existing_numbers = {n.strip() for n in bigfile}

with open('newfile.txt') as newfile, open('bigfile.txt', 'w') as bigfile:
    for number in newfile:
        number = number.strip()
        if number not in existing_numbers:
            bigfile.write(number + '\n')

This adds numbers not already in bigfile to the end, in as efficient a way as possible.

If bigfile becomes too big for the above to run efficiently, you may need to use a database instead.

Upvotes: 2

Extract lines from a file only once using shell/python/perl

Answers (4)

output

Related Questions