erogol
erogol

Reputation: 13614

Removing newline character from a string in Perl

I have a string that is read from a text file, but in Ubuntu Linux, and I try to delete its newline character from the end.

I used all the ways. But for s/\n|\r/-/ (I look whether it finds any replaces any new line string) it replaces the string, but it still goes to the next line when I print it. Moreover, when I used chomp or chop, the string is completely deleted. I could not find any other solution. How can I fix this problem?

use strict;
use warnings;
use v5.12;
use utf8;
use encoding "utf-8";

open(MYINPUTFILE, "<:encoding(UTF-8)", "file.txt");

my @strings;
my @fileNames;
my @erroredFileNames;

my $delimiter;
my $extensions;
my $id;
my $surname;
my $name;

while (<MYINPUTFILE>)
{
    my ($line) = $_;
    my ($line2) = $_;
    if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
        #chop($line2);
        $line2 =~ s/^\n+//;
        print $line2 . " WRONG FORMAT!\n";
    }
    else {
        #print "INSERTED:".$13."\n";
        my($id) = $13;
        my($name) = $2;
        print $name . "\t" . $id . "\n";
        unshift(@fileNames, $line2);
        unshift(@strings, $line2 =~ /[^\W_]+/g);
    }
}
close(MYINPUTFILE);

Upvotes: 7

Views: 76826

Answers (5)

Felippe Silvestre
Felippe Silvestre

Reputation: 89

$variable = join('',split(/\n/,$variable))

Upvotes: 3

GoldenNewby
GoldenNewby

Reputation: 4452

You can wipe the linebreaks with something like this:

$line =~ s/[\n\r]//g;

When you do that though, you'll need to change the regex in your if statement to not look for them. I also don't think you want a /g in your if. You really shouldn't have a $line2 either.

I also wouldn't do this type of thing:

print $line2." WRONG FORMAT!\n";

You can do

print "$line2 WRONG FORMAT!\n";

... instead. Also, print accepts a list, so instead of concatenating your strings, you can just use commas.

Upvotes: 7

TLP
TLP

Reputation: 67900

You are probably experiencing a line ending from a Windows file causing issues. For example, a string such as "foo bar\n", would actually be "foo bar\r\n". When using chomp on Ubuntu, you would be removing whatever is contained in the variable $/, which would be "\n". So, what remains is "foo bar\r".

This is a subtle, but very common error. For example, if you print "foo bar\r" and add a newline, you would not notice the error:

my $var = "foo bar\r\n";
chomp $var;
print "$var\n";  # Remove and put back newline

But when you concatenate the string with another string, you overwrite the first string, because \r moves the output handle to the beginning of the string. For example:

print "$var: WRONG\n";

It would effectively be "foo bar\r: WRONG\n", but the text after \r would cause the following text to wrap back on top of the first part:

foo bar\r           # \r resets position
 : WRONG\n          # Second line prints and overwrites

This is more obvious when the first line is longer than the second. For example, try the following:

perl -we 'print "foo bar\rbaz\n"'

And you will get the output:

baz bar

The solution is to remove the bad line endings. You can do this with the dos2unix command, or directly in Perl with:

$line =~ s/[\r\n]+$//;

Also, be aware that your other code is somewhat horrific. What do you for example think that $13 contains? That'd be the string captured by the 13th parenthesis in your previous regular expression. I'm fairly sure that value will always be undefined, because you do not have 13 parentheses.

You declare two sets of $id and $name. One outside the loop and one at the top. This is very poor practice, IMO. Only declare variables within the scope they need, and never just bunch all your declarations at the top of your script, unless you explicitly want them to be global to the file.

Why use $line and $line2 when they have the same value? Just use $line.

And seriously, what is up with this:

if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {

That looks like an attempt to obfuscate, no offence. Three nested negations and a bunch of unnecessary parentheses?

First off, since it is an if-else, just swap it around and reverse the regular expression. Second, [^\W_] a double negation is rather confusing. Why not just use [A-Za-z0-9]? You can split this up to make it easier to parse:

if ($line =~ /^(.+)(\.docx)\s*$/) {
    my $pre = $1;
    my $ext = $2;

Upvotes: 12

tchrist
tchrist

Reputation: 80384

The correct way to remove Unicode linebreak graphemes, including CRLF pairs, is using the \R regex metacharacter, introduced in v5.10.

The use encoding pragma is strongly deprecated. You should either use the use open pragma, or use an encoding in the mode argument on 3-arg open, or use binmode.

 use v5.10;                     # minimal Perl version for \R support
 use utf8;                      # source is in UTF-8
 use warnings qw(FATAL utf8);   # encoding errors raise exceptions
 use open qw(:utf8 :std);       # default open mode, `backticks`, and std{in,out,err} are in UTF-8

 while (<>) {
     s/\R\z//;
     ...
 }

Upvotes: 19

vol7ron
vol7ron

Reputation: 42099

You can do something like:

=~ tr/\n//

But really chomp should work:

while (<filehandle>){
   chomp;
   ...
}

Also s/\n|\r// only replaces the first occurrence of \r or \n. If you wanted to replace all occurrences you would want the global modifier at the end s/\r|\n//g.

Note: if you're including \r for windows it usually ends its line as \r\n so you would want to replace both (e.g. s/(?:\r\n|\n)//), of course the statement above (s/\r|\n//g) with the global modifier would take care of that anyways.

Upvotes: 4

Related Questions