Izaak Williamson
Izaak Williamson

Reputation: 185

How do I randomly remove a certain fraction of a file’s lines?

I am using a visualizer to view atom probe data.

My output files contain 4 columns. Each line contains x, y, and z coordinates of the atom plus an intensity value that determines which atom it is. Each element in the system has an output file.

I have code that counts the number of lines in each output file and divides that by the total to calculate the composition of the system. For example, if the sum of all the number of lines in each output file is 100 and my iron atom output file contains 85 lines then 85% of the system is made up of iron atoms.

Now, I want to reduce the number of iron atoms so it’s easier to see the other atoms. How can I randomly remove 90% of the lines from the output file? I want to make something like this conditional:

if ($atom>80) {      #such as iron being 85
    #randomly remove lines, perhaps with rand()
}

Upvotes: 1

Views: 176

Answers (2)

Greg Bacon
Greg Bacon

Reputation: 139431

A thinning implementation that uses reservoir sampling:

#! /usr/bin/env perl

use strict;
use warnings;

use Fcntl qw/ SEEK_SET /;

die "Usage: $0 fraction file\n" .
    "  where 1 <= fraction <= 99\n"
  unless @ARGV == 2;

my($fraction,$path) = @ARGV;
die "$0: invalid fraction: $fraction"
  unless $fraction =~ /^[0-9]+$/ && $fraction >= 1 && $fraction <= 99;

open my $fh, "<", $path or die "$0: open $path: $!";
my $lines;
++$lines while defined($_ = <$fh>);

# modified Algorithm R from Knuth's TAoCP Volume 2, pg. 144
my $rsize = my $samples = int (($lines / 100) * $fraction);
my @delete = (1 .. $samples);
foreach my $t ($samples+1 .. $lines) {
  my $m = int(rand $t) + 1;
  $delete[$m] = ++$rsize if $m <= $samples;
}

seek $fh, 0, SEEK_SET or die "$0: seek: $!";
my %delete = map +($_ => 1), @delete;
$. = 1;
while (<$fh>) {
  print unless delete $delete{$.};
}

Upvotes: 0

amon
amon

Reputation: 57600

The rand function produces a real value in the interval [0, 1). If we want a condition that returns true 90% of the time, we can write rand() < 0.9. As you only want to keep 10% of iron atoms:

my $percentage = shift @ARGV;
while (<>) {
  print unless this_record_is_iron() && rand() < $percentage;
}

Then:

$ perl reduce_iron.pl 0.9 input-data >reduced-data

If we want to delete exactly 90%, then I'd read in the whole file, make an array of indices that point to iron records, shuffle the index list, and delete all but the last 10%:

use List::Util qw/shuffle/;
my $percentage = shift @ARGV;
my(@lines, @iron_idx);
while (<>) {
  push @lines, $_;
  push @iron_idx, $#lines if this_record_is_iron();
}
@iron_idx = (shuffle @iron_idx)[0 .. @iron_idx * $percentage - 1]; # keep indices to delete
$_ = "" for @lines[@iron_idx];
print @lines;

Upvotes: 4

Related Questions