TheBlackCorsair
TheBlackCorsair

Reputation: 527

merge multiple CSV files perl

How can i merge multiple CSV files in perl?

For example I have file 1 Packet1.csv looking like:

#type, number, info, availability
computer, t.100, pentium 2, yes
computer, t.1000, pentium 3, yes
computer, t.2000, pentium 4, no
computer, t.3000, pentium 5, yes

and file 2 Packet2.csv looking like:

#type, number, info, availability
computer, t.100, pentium 2, yes
computer, t.1000, pentium 3, no
computer, t.2000, pentium 4, no
computer, t.4000, pentium 6, no

and the output i desire is a single file where the number of Packets is not fixed :

#type, number, info, **Packet1** availability, **Packet2** availability
computer, t.100, pentium 2, yes, yes
computer, t.1000, pentium 3, yes, no
computer, t.2000, pentium 4, no, no
computer, t.3000, pentium 5, yes
computer, t.4000, pentium 6, no

Upvotes: 0

Views: 784

Answers (2)

David W.
David W.

Reputation: 107090

  • How do you identify which computer is which? Do you depend upon the first three fields as the computer identification?
  • What if the first field isn't computer?
  • What happens if the two files disagree with the computer type?

You really have to answer these questions before you can figure out how to handle this. However, you're probably going to have to deal with references.

I think your question has to do with the fact that standard Perl data structures only store a single value. You can have a hash of single values and you can have arrays of single values, but you can't have multiple values in each piece of structure. Perl gets around this by using references.

For example, let's say you have a hash called %computer that is keyed by that second field:

my %system;

$system{t.100} = {}    #This is a hash of hashes
$system{t.100}->{INFO} = "pentium 2";
$system{t.100}->{TYPE} = "computer";
$computer{t.100}->{AVAILABLITY} = []  #Storing an array in this hash entry (hash of hashes of arrays)
$computer{t.100}->{AVAILABILITY}->[0] = "yes";
$computer{5.100}->{AVAILABILITY}->[1] = "yes";

You could also use push and pop by dereferencing the array:

push @{ $computer{t.100}->{AVAILABILITY} }, "yes";

Note that I surrounded the reference to the array $computer{t.100}->{AVAILABILITY} with @{...}, and it turns from a reference to an array back to an array.

I hope this is what you're asking. You could use the Text::CSV module to parse your CSV file, but if the format isn't too wacky, you probably could just use the split command.

Upvotes: 0

MattLBeck
MattLBeck

Reputation: 5841

Going back to your attempt at multidimensional hashing: Hash of hashes perl, you will need to change the data structure you are using in order to store multiple entries of a particular element.

CSVs can be intuitively read in to a hash with 2 levels. The rows of the csv can be hashed by their IDs (in this case I guess the IDs are the numbers 't.100', 't.1000' etc) and the values of each row can be stored in the second level hash using the header strings as its keys. It will look something like this if you viewed the structure with Data::Dumper:

$VAR1 = {
          't.1000' => {
                        'info' => 'pentium 3',
                        'availability' => 'yes',
                        'type' => 'computer'
                      },
          't.100' => {
                       'info' => 'pentium 2',
                       'availability' => 'yes',
                       'type' => 'computer'
                     }
        };

Whether 'number' is also a key for each 'row hash' is up to you depending on how useful that might be (usually you already know the key for the row in order to access it).

This data structure would be fine in order to store one CSV file. However we need to add an extra layer of complexity in order to cope with merging multiple CSVs in the way that you describe. For example, to keep track of the files that a particular ID appears in, we can store a third hash as the value of the 'availability' key, since that is the value that is changing between entries of the same 'number':

'availability' => {
          'Packet1' => 'yes',
          'Packet2' => 'no'
        };

Once all files have been read into this structure, printing the final CSV out is then a process of looping over the keys of the outer hash and, for each row, 'joining' the row's keys in the correct order. The 'Packet' hash can also be looped over to retrieve all 'availability' values and these can be appended to the end of each row.

I hope that helps you understand one possible way of dealing with this kind of data. You can ask about specific parts of the implementation if you are finding them difficult and I will be happy elaborate.

Upvotes: 3

Related Questions