Scoox
Scoox

Reputation: 11

Perl Regular expression remove double tabs, line breaks, white spaces

I want to write a perl script that removes double tabs, line breaks and white spaces.

What I have so far is:

$txt=~s/\r//gs;
$txt=~s/ +/ /gs;
$txt=~s/\t+/\t/gs;
$txt=~s/[\t\n]*\n/\n/gs;
$txt=~s/\n+/\n/gs;

But, 1. It's not beautiful. Should be possible to do that with far less regexps. 2. It just doesn't work and I really do not know why. It leaves some double tabs, white spaces and empty lines (i.e. lines with only a tab or whitespace)

I could solve it with a while, but that is very slow and ugly.

Any suggestions?

Upvotes: 1

Views: 24822

Answers (4)

Joel Berger
Joel Berger

Reputation: 20280

As I try to work out a quick real answer for you, have you looked at the docs (and no I'm not just saying rtfm). perldoc is a great tool and has some useful info, may I suggest perldoc perlrequick and perldoc perlreref to get you going.

First of all, you might find it easier to split the long text into lines and operate on the lines separately and then join them again. Also if we make a new array to store the results to be joined we can easily exclude empty lines.

Finally, it strikes me that in operating on a long block of text, that text is likely to be external to your script. If you are really opening a file and globbing it into a variable, you could more easily do what I am leaving in as a comment block. To use this method comment the first block and remove the comment on the second block, the third block remains for either method. I include this because if you really are reading in a file then splitting it, it saves a lot of work to just read it in line by line. You could then write it out to another file if desired.

#!/usr/bin/env perl

use strict;
use warnings;

my @return_lines;

### Begin "text in script" Method ###
my $txt = <<END;
hello  world  

 hello world  
hello    world
hello   world
END
#note last two are to test removing spaces after tabs

my @lines = split(/\n/, $txt);
foreach my $line (@lines) {

### Begin "text in external file" Method (commented) ###
#my $filename = 'file.txt';
#open( my $filehandle, '<', $filename); 
#while (<$filehandle>) {
#  my $line = $_; 

### Script continues for either input method ###
  $line =~ s/^\s*//; #remove leading whitespace
  $line =~ s/\s*$//; #remove trailing whitespace
  $line =~ s/\ {2,}/ /g; #remove multiple literal spaces
  $line =~ s/\t{2,}/\t/g; #remove excess tabs (is this what you meant?)
  $line =~ s/(?<=\t)\ *//g; #remove any spaces after a tab 
  push @return_lines, $line unless $line=~/^\s*$/; #remove empty lines
}
my $return_txt = join("\n", @return_lines) . "\n";

print $return_txt;

Upvotes: 2

DVK
DVK

Reputation: 129403

This is a bit un-clear.

If you have a line like ab TABcTABTAB \n\n, what do you want as a result? I am reading the above as ab c\n? – DVK 1 min ago edit

In other words, is it correct that you want:

  1. All the whitespace (e.g. any amount of spaces and tabs) in the middle of the lines converted to a single space?

  2. All the whitespace at the beginning OR end of the line removed (except for newlines)?

  3. Remove completely empty lines?

    $s =~ s/[\t ]+$//ms; # Remove ending spaces/tabs
    $s =~ s/^[\t ]+//ms; # Remove starting spaces/tabs
    $s =~ s/[\t ]+/ /ms; # Replace duplicate whitespace mid-string with 1 space
    $s =~ s/^$//ms;      # Remove completely empty lines
    

Please note that I used the "/ms" modifyers (read perdoc perlre for details) so that I could use start/end of line anchors within a multi-line string.

Upvotes: 1

Michael Kohne
Michael Kohne

Reputation: 12044

You've got a bit of a mish-mash of stuff in there, not all of which corresponds to what you said. Let's break down what you have and then perhaps you can work from there to what you want.

$txt=~s/\r//s; # removes a single \r from the line. Did you mean to use g on this one?
$txt=~s/[\t ]\n//s; # match a single \t OR space right before a \n, and remove. 
$txt=~s/ +/ /gs;# match at least 2 spaces, replace with a single space
$txt=~s/\t+/ /gs;# match at least 2 \t, replace with a single space
$txt=~s/\n /\n/s;# remove a space immediately following a \n
$txt=~s/\t /\t/s;# remove a space immediately following a \t
$txt=~s/\n+/ /gs;# match at least 2 \n, replace them all with a single space

I have the feeling that's not at all what you want to accomplish.

I'm honestly unclear on what you want to do. The way I read your stated intent, I would have thought you'd want to replace all double tabs with single tabs, all double line breaks with single line breaks, and all double spaces with single spaces. I'll further surmise that you want to actually do runs of those characters, not just doubles. Here's the regexes for what I've just said, hopefully that will give you something to go on: (I've also removed all \r).

$txt=~s/\r//gs;# remove all \r
$txt=~s/\t+/\t/gs;# replace all runs of > 1 tab with a single tab
$txt=~s/\n+/\n/gs;# replace all runs of > 1 \n with a single \n
$txt=~s/ +/ /gs;# replace all runs of > 1 space with a single space

Given that your attempted regexes don't seem to match the way I read your stated desire, I suspect that there's some fuzziness about what you really want to do here. You might want to think further about what you are trying to accomplish, which should help the regexes become clearer.

Upvotes: 3

justintime
justintime

Reputation: 3631

I am not sure of your exact requirements, but here are a few hint that might get you going :

To compress all white space to spaces (probably too powerful!)

$txt=~s/\s+/ /g ;

To remove any white space at start of line

$txt=~s/^ +//gm ;

To compress multiple tabs to a space

$txt=~s/\t+/ /g ;

Upvotes: 2

Related Questions