royskatt
royskatt

Reputation: 1210

Substitute first character before match

For each line I need to add a semicolon exactly one character before the first match of an alphanumeric sign but only for the alphanumeric sign after the first appearance of a semicolon.

Example:

Input:

00000001;Root;;
00000002;  Documents;;
00000003;    oracle-advanced_plsql.zip;file;
00000004;  Public;;
00000005;  backup;;
00000006;    20110323-JM-F.7z.001;file;
00000007;    20110426-JM-F.7z.001;file;
00000008;    20110603-JM-F.7z.001;file;
00000009;    20110701-JM-F-via-summer_school;;
00000010;      20110701-JM-F-via-summer_school.7z.001;file;

Desired output:

00000001;;Root;;
00000002;  ;Documents;;
00000003;    ;oracle-advanced_plsql.zip;file;
00000004;  ;Public;;
00000005;  ;backup;;
00000006;    ;20110323-JM-F.7z.001;file;
00000007;    ;20110426-JM-F.7z.001;file;
00000008;    ;20110603-JM-F.7z.001;file;
00000009;    ;20110701-JM-F-via-summer_school;;
00000010;      ;20110701-JM-F-via-summer_school.7z.001;file;

Could someone helps me please to create Perl regex for that? I'd need it in a program, not as a oneliner.

Upvotes: 2

Views: 2272

Answers (3)

royskatt
royskatt

Reputation: 1210

First of all thank you for your really great answers!

Actually my code snippet looks like this:

 our $seperator=";" # at the beginning of the file
 #...
 sub insert {
    my ( $seperator, $line, @all_lines, $count, @all_out );
    $count     = 0;
    @all_lines = read_file($filename);

    foreach $line (@all_lines) {
        $count = sprintf( "%08d", $count );
        chomp $line;
        $line =~ s/\:/$seperator/;                          # works
        $line =~ s/\ file/file/;                            # works

        #$line=~s/;\s*\K(?=\S)/;/;                          # doesn't work
        $line =~ s/^(.*?$seperator.*?)(\w)/$1$seperator$2/; # doesn't work
        say $count . $seperator . $line . $seperator; 

        $count++; # btw, is there maybe a hidden index variable in a foreach-loop I could us instead of a new variable??
        push( @all_out, $count . $seperator . $line . $seperator . "\n" );
    }

    write_file( $csvfile, @all_out ); # using File::Slurp
}

In order to get the input which I presented you, I made already some small substitutions, as you can see in the beginning of the foreach-loop.

I am curious, why the regular expressions presented by TLP and Yaakov do not work in my code. In general they work, but only when written like in the example which Yaakov gave:

while(<>) {                                                           
  s/^(.*?;.*?)(\w)/$1;$2/;                                            
  print $_;                                                           
}      

Upvotes: 0

Yaakov Belch
Yaakov Belch

Reputation: 4863

First of all, here is a program that seems to match your requirements:

#/usr/bin/perl -w
while(<>) {                                                           
  s/^(.*?;.*?)(\w)/$1;$2/;                                            
  print $_;                                                           
}                                                                     

Store it in a file 'program.pl', make it executable with 'chmod u+x program.pl' and run it on your input data like this:

program.pl input-data.txt

Here is an explanation of the regular expression:

s/        # start search-and-replace regexp
  ^       # start at the beginning of this line
  (       # save the matched characters until ')' in $1
    .*?;  # go forward until finding the first semicolon
    .*?   # go forward until finding... (to be continued below)
  )
  (       # save the matched characters until ')' in $2
    \w    # ... the next alphanumeric character.
  )
/         # continue with the replace part
  $1;$2   # write all characters found above, but insert a ; before $2
/         # finish the search-and-replace regexp.

Based on your sample input, I would use a more specific regular expression:

s/^(\d*; *)(\w)/$1;$2/;

This expression starts at the beginning of the line, skips over numbers (\d*) followed by the first semicolon and space. Before the following word character, it inserts a semicolon.

Take what fits best to your needs!

Upvotes: 1

TLP
TLP

Reputation: 67890

This is a way to insert a semi-colon after the first semi-colon and whitespace, but before the first non-whitespace.

s/;\s*\K(?=\S)/;/

If you feel the need, you can use \w instead of \S, but I felt with this input it was an unnecessary specification.

The \K (keep) escape is similar to a lookbehind assertion in that it does not remove what it matches. The same goes for the lookahead assertion, so all this substitution does is insert a semi-colon in the designated spot.

Upvotes: 3

Related Questions