Reputation: 357
I have two very large XML files that have different kinds of line endings. File A has CR LF at the end of each XML record. File B has only CR at the end of each XML record.
In order to read File B properly, I need to set the built-in Perl variable $/ to "\r". But if I'm using the same script with File A, the script does not read each line in the file and instead reads it as a single line.
How can I make the script compatible with text files that have various line ending delimiters? In the code below, the script is reading XML data and then using regex to split records based on a specific XML tag record ending tag like <\record>. Finally it writes the requested records to a file.
open my $file_handle, '+<', $inputFile or die $!;
local $/ = "\r";
while(my $line = <$file_handle>) { #read file line-by-line. Does not load whole file into memory.
$current_line = $line;
if ($spliceAmount > $recordCounter) { #if the splice amount hasn't been reached yet
push (@setofRecords,$current_line); #start adding each line to the set of records array
if ($current_line =~ m|$recordSeparator|) { #check for the node to splice on
$recordCounter ++; #if the record separator was found (end of that record) then increment the record counter
}
}
#don't close the file because we need to read the last line
}
$current_line =~/(\<\/\w+\>$)/;
$endTag = $1;
print "\n\n";
print "End Tag: $endTag \n\n";
close $file_handle;
Upvotes: 1
Views: 492
Reputation: 14699
While you may not need it for this, in theory, to parse .xml, you should use an xml parser. I'd recommend XML::LibXM or perhaps to start off with XML::Simple.
Upvotes: 1
Reputation: 118605
If the file isn't too big to hold in memory, you can slurp the whole thing into a scalar and split it into the correct lines yourself with a suitably flexible regular expression. For example,
local $/ = undef;
my $data = <$file_handle>;
my @lines = split /(?>\r\n)|(?>\r)|(?>\n)/, $data;
foreach my $line (@lines) {
...
}
Using a look-ahead assertion (?>...)
preserves the end-of-line characters like the regular <>
operator does. If you were just going to chomp them anyway, you can save yourself a step by passing /\r\n|\r|\n/
to split
instead.
Upvotes: 0