Reputation: 55

How to remove redundant fields and merge resulting lines

I am attempting to process a plain text file. It is basically an index of names and associated number fields formatted like so:

Nowosielski, Matthew, 484, 584, 777
Nowosielski, Matthew, 1151
Nunes, Paulino, 116
Nussbaum, Mike, 1221, 444,
Nussbaum, Mike, 156

Which I would like to process into this

Nowosielski, Matthew, 484, 584, 777, 1151
Nunes, Paulino, 116
Nussbaum, Mike, 156, 444, 1221

As you can see, the lines do not end consistently: some are likely to be whitespace, some newlines and some with commas. Effectively, I need to merge lines beginning with duplicated full-names, discarding the redundant name entry while merging and preserving the numerical order of the numerical fields.

My gut tells me to learn either some quick perl or awk, but my skill-set is, for both, empty. I looked into both, and after some searching and reading haven't been able to find a clear or clean path to a solution.

My question thus is: what would be the best tool for the job that I might learn efficiently and just enough to complete this task? Also, given the suggested tool, are there any suggestions on how to approach the problem?

I can just edit this file by hand, of course, but that's not very interesting and seems to be a very stupid, ham-fisted approach to the problem. I'm taking this task as an excuse to learn a bit about text processing as it feels like a problem for which there's probably a good, existing tool.

Any pointers?

Upvotes: 3

Answers (4)

cryptochaos

Reputation: 41

Try using AWK

#!/usr/bin/awk -f
$1 == lastOne && $2 == lastTwo { $1=""; $2=""; printf ", %s", $0 ;lastOne=$1; lastTwo=$2 }
$1 != lastOne && $2 != lastTwo { printf "\n%s", $0 ;lastOne=$1; lastTwo=$2 }
END {printf "\n" }

This script assumes the data is sorted my your first two fields...

Upvotes: 0

Jonathan Leffler

Reputation: 754400

To do this cleanly, you need a language with associative arrays (Perl - hashes; Python - dictionaries; Awk - associative arrays). That rules out sed (and C).

In awk:

awk '{ for (i = 3; i <= NF; i++) {names[$1, $2] = names[$1, $2] " " $i } }
     END { for (name in names) { printf "%s: %s\n", name, names[name]; } }'

You might prefer to specify the comma as a field delimiter with '-F,'.

The extra requirements - sort numbers in order and handle middle names - are much fiddlier to handle in awk than perl; with the extra requirements, I'd go with perl rather than awk. (Note that GNU Awk has built-in functions asort and asorti to sort arrays, but I'm not sure you can have 'names[$1,$2] identifying an array of integers in awk.) I'm much more fluent in Perl than Python - but Python could undoubtedly do what Perl handles too.

Upvotes: 1

Pedro Silva

Reputation: 4700

As Brian said, use a hash table. The following removes newlines, splits each record on commas, uses the "last name, first name" original form as a key to a hash, pushes the remaining values into an array and uses a reference to said array as the value to the above key.

Then it just iterates over the key/value pairs in the hash and formats accordingly.

Amended solution - sorting numbers, omitting middle names, and sorting output

#!/usr/bin/env perl
use strict;
use warnings;

my %merged;

while (my $record = <DATA>) {
    chomp $record;
    my ($lname, $fname, @stuff) = split /[, ]+/, $record;
    push @{ $merged{"$lname, $fname"} }, grep { m/^\d+$/; } @stuff;
}

foreach my $name (sort keys %merged) {
    print $name, ", ", join( ', ', sort { $a <=> $b } @{$merged{$name}}), "\n";
}

__DATA__
Nowosielski, Matthew, 484, 584, 777
Nowosielski, Matthew, 1151
Nunes, Paulino, 116
Nussbaum, Mike, 1221, 444,
Nussbaum, Mike, 156
Nowosielski, Matthew, Kimball, 485, 684, 277

Amended output

Nowosielski, Matthew, 277, 484, 485, 584, 684, 777, 1151
Nunes, Paulino, 116
Nussbaum, Mike, 156, 444, 1221

Original solution

#!/usr/bin/env perl
use strict;
use warnings;

my %merged;

while (my $record = <DATA>) {
    chomp $record;
    my ($lname, $fname, @stuff) = split /,/, $record;

    push @{ $merged{"$lname, $fname"} }, @stuff;
}

while (my ($name, $stuff) = each %merged) {
    print $name, join( ',', @$stuff), "\n"; 
}

__DATA__
Nowosielski, Matthew, 484, 584, 777
Nowosielski, Matthew, 1151
Nunes, Paulino, 116
Nussbaum, Mike, 1221, 444,
Nussbaum, Mike, 156

Upvotes: 4

Brian Clements

Reputation: 3905

Seeing this as an excuse to learn, I would write a quick python script.

Make yourself a dictionary (map) with strings as keys and values. Read in a line and grab the name. Look up the name in the dictionary. If it is in there, append the new numbers to the end of the dictionary entry. When you've read the whole file, iterate through the dictionary and print out the keys and values.

Upvotes: 2

How to remove redundant fields and merge resulting lines

Answers (4)

Amended solution - sorting numbers, omitting middle names, and sorting output

Amended output

Original solution

Related Questions