Steve P.
Steve P.

Reputation: 14709

Confused about single quotes/double quotes and \\ with respect to split

So, I saw in another post that to split using \\ as a delimiter, you need to split on \\\\\\\\. This didn't really make sense to me, but when I attempted to split using \\\\, this happened:

my $string="a\\\\b\\\\c";
my @ra=split("\\\\",$string);

Array is:

a
<empty>    
b
<empty>
c

As the other poster said, using \\\\\\\\ works perfectly. Why is this the case?

Also, I got curious and started messing with '' vs "" and got unexpected results. I thought that I understood what the difference is, but I guess I didn't, at least not in the following context:

my $string="a\.\.b\.\.c";
my @ra=split("\.\.",$string);

Array is:

<empty>
<empty>
<empty>
c

Yet,

my $string="a\.\.b\.\.c";
my @ra=split('\.\.',$string);

Array is:

a
b
c

Thanks in advance.

Upvotes: 3

Views: 996

Answers (3)

OneSolitaryNoob
OneSolitaryNoob

Reputation: 5767

Split using /\\\\/ instead of "\\\\" and avoid all the worries,

e.g.

use Data::Dumper;

my $string= "a\\\\b\\\\c";

my @ra = split /\\\\/, $string;

print Dumper @ra;

will output

$VAR1 = [
          'a',
          'b',
          'c'
        ];

/\\/ will match a two \ in a row

or you can be cute and do

split /\\{2}/, $string

Upvotes: 0

ikegami
ikegami

Reputation: 386706

  • In single-quoted strings literals,

    • \ followed by the string delimiter (' by default) results in the string delimiter.

      'That\'s fool\'s gold!'   -> That's fool's gold!
      q!That's fool's gold\!!   -> That's fool's gold!
      
    • \ followed by \ results in \.

      'c:\\foo'                 -> c:\foo
      
    • \ followed by anything else results in those two characters.

      'c:\foo'                  -> c:\foo
      
  • In double-quoted strings literals,

    • \ followed by non-word character results in that character.

      "c:\\foo"                 -> c:\foo
      "Can't open \"foo\""      -> Can't open "foo"
      
    • \ followed by word character has a special meaning.

      "foo\n"                   -> foo{newline}
      
  • In regular expressions literals,

    • \ followed by the delimiter is replaced results in the delimiter.

      qr/\//                    -> /
      
    • \ followed by anything else results in those two characters.

      qr/\\/                    -> \\
      qr/\_/                    -> \_
      qr/\$/                    -> \$
      qr/\n/                    -> \n
      
  • When applying a regular expressions,

    • \ followed by non-word character matches that character.

      /c:\\foo/                 -> Matches strings containing: c:\foo
      
    • \ followed by word character has a special meaning.

      /foo\z/                   -> Matches strings ending with: foo
      

Looking at your cases:

 my $string="a\\\\b\\\\c";
 my @ra=split("\\\\",$string);

"\\\\" results in the string \\, so you first create the string a\\b\\c and you pass \\ to split.

The first argument of split is used as a regular expression, and the regex pattern \\ matches a single \. There are 4 \ in a\\b\\c, so it gets split into 4+1 pieces.

If you use regex literals instead of double-quoted string literals, there will be less confusion.

split(/\\/, $string);        # Passes pattern \\ to split. Matches singles
split("\\\\", $string);      # Passes pattern \\ to split. Matches singles
split(/\\\\/, $string);      # Passes pattern \\\\ to split. Matches doubles
split("\\\\\\\\", $string);  # Passes pattern \\\\ to split. Matches doubles

In short, don't use split "..."!


Your other two cases should be obvious to you by now.

my $string="a\.\.b\.\.c";          # String a..b..c
my @ra=split("\.\.",$string);      # Pattern .., which matches any two chars.

my $string="a\.\.b\.\.c";          # String a..b..c
my @ra=split('\.\.',$string);      # Pattern \.\., which matches two periods.

Upvotes: 3

amon
amon

Reputation: 57656

Oh, quoting rules and regexes.

Backslash rules with different quotes

  • In q() and related, all backslashes are left in the string, unless they escape the string delimiter or another backslash:

    say '\a\\b\''; # »\a\b'«
    
  • In qq() and related, all backslashes that do not form a known string escape sequence are silently removed:

    say "\d\\b\"\."; # »d\b."«
    
  • Ditto in qr// and regex literals, except that there are different escapes compared to double quoted strings.

If a string is used in place of a regex, then during compilation the escape rules for that kind of string are performed. However, a second level of escapes is processed when it is used as a regex, hence backslashes have to be double-escaped in the worst cases. Regex literals don't suffer from this problem; there is only one level of escaping.

Explanations for your examples

Therefore, "a\\\\b\\\\c"; is a\\b\\c, and "\\\\" is \\ which matches \ as a regex. So it splits on every backslash, thus producing zero-length fields in between the double backslashes.

The '\\\\\\\\' of the other question you meant is \\\\ which as a regex matches \\.

The "a\.\.b\.\.c" is a..b..c, and "\.\." is .. which as a regex matches two non-newline characters. It first matches a., then .b, then ... This produces the string fragments "", "", "", "c".

The string '\.\.' is \.\., which as a regex matches two literal periods in sequence.

The solution is to use regexes where regexes are due. split takes a regex as first argument like split /foo/, in other scenarios the regex quote qr/foo/ is useful. This avoids mind-bending[1] double escaping.

[1]: for small values of ”mind bending”, once you grok the rules.

Upvotes: 4

Related Questions