Reputation: 185
I am using a visualizer to view atom probe data.
My output files contain 4 columns. Each line contains x, y, and z coordinates of the atom plus an intensity value that determines which atom it is. Each element in the system has an output file.
I have code that counts the number of lines in each output file and divides that by the total to calculate the composition of the system. For example, if the sum of all the number of lines in each output file is 100 and my iron atom output file contains 85 lines then 85% of the system is made up of iron atoms.
Now, I want to reduce the number of iron atoms so it’s easier to see the other atoms. How can I randomly remove 90% of the lines from the output file? I want to make something like this conditional:
if ($atom>80) { #such as iron being 85
#randomly remove lines, perhaps with rand()
}
Upvotes: 1
Views: 176
Reputation: 139431
A thinning implementation that uses reservoir sampling:
#! /usr/bin/env perl
use strict;
use warnings;
use Fcntl qw/ SEEK_SET /;
die "Usage: $0 fraction file\n" .
" where 1 <= fraction <= 99\n"
unless @ARGV == 2;
my($fraction,$path) = @ARGV;
die "$0: invalid fraction: $fraction"
unless $fraction =~ /^[0-9]+$/ && $fraction >= 1 && $fraction <= 99;
open my $fh, "<", $path or die "$0: open $path: $!";
my $lines;
++$lines while defined($_ = <$fh>);
# modified Algorithm R from Knuth's TAoCP Volume 2, pg. 144
my $rsize = my $samples = int (($lines / 100) * $fraction);
my @delete = (1 .. $samples);
foreach my $t ($samples+1 .. $lines) {
my $m = int(rand $t) + 1;
$delete[$m] = ++$rsize if $m <= $samples;
}
seek $fh, 0, SEEK_SET or die "$0: seek: $!";
my %delete = map +($_ => 1), @delete;
$. = 1;
while (<$fh>) {
print unless delete $delete{$.};
}
Upvotes: 0
Reputation: 57600
The rand
function produces a real value in the interval [0, 1). If we want a condition that returns true 90% of the time, we can write rand() < 0.9
. As you only want to keep 10% of iron atoms:
my $percentage = shift @ARGV;
while (<>) {
print unless this_record_is_iron() && rand() < $percentage;
}
Then:
$ perl reduce_iron.pl 0.9 input-data >reduced-data
If we want to delete exactly 90%, then I'd read in the whole file, make an array of indices that point to iron records, shuffle the index list, and delete all but the last 10%:
use List::Util qw/shuffle/;
my $percentage = shift @ARGV;
my(@lines, @iron_idx);
while (<>) {
push @lines, $_;
push @iron_idx, $#lines if this_record_is_iron();
}
@iron_idx = (shuffle @iron_idx)[0 .. @iron_idx * $percentage - 1]; # keep indices to delete
$_ = "" for @lines[@iron_idx];
print @lines;
Upvotes: 4