Josh
Josh

Reputation: 170

Removing specific character from anywhere between two specific strings?

I have a large text file that contains content as per the below example:

number="+123 123 123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456 789" text="Numbers here should keep their spaces"
number="+9 8 7 6 5" text="example 123 123 123"

What I would like is to remove any whitespace character between two identifying strings, in this case number= and " text= without touching the rest of the line. So that the desired output would be:

number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"

A regex like (?<=[0-9])(\s)(?=[0-9]) will interfere with with text field, which is undesirable.

I have tested a few variations of using something along the lines of (?<=address)(\s)(?=date) but this doesn't work. I think the problem lies in not being able to deal with the extra possible numbers in between the whitespace and the markers?

Adding wildcard matches into the lookbehinds/lookaheads such as (?<=address.*)(\s)(?=.*date) doesn't seem valid or else I've done it wrong? Also making the whitespace lazy with (/s+?) doesn't seem to help me, but this is about where my knowledge of regex really falls to pieces :)

Ideally I would also like to restrict between the extra equals and quotes characters for safety. I.e number=" at the beginning marker and text=" as the end marker.

Any sed/awk or similar solutions are also welcomed if easier.

Upvotes: 2

Views: 170

Answers (5)

mklement0
mklement0

Reputation: 437753

Note: This is a complement to the existing answers to compare their performance.

Test environments:

  • OS X 10.9.4.
    • FreeBSD awk 20070501
    • FreeBSD sed (cannot tell version number)
    • Perl v5.16.2
  • Ubuntu 14.04
    • GNU Awk 4.0.1
    • sed (GNU sed) 4.2.2
    • Perl v5.18.2

The short of it:

On OS X, the differences aren't dramatic.
On Ubuntu, the differences between the awk and the perl solutions are small, but the sed solution is dramatically slower.

Sample numbers, running against a 100,000-line input file 10 times. Don't compare them directly (Ubuntu is running in a VM on the OS X machine), just look at their ratios. (Curiously, though, awk and perl ran faster in the Ubuntu VM):

OS X:

# awk (@japyal)
real    0m3.848s
user    0m3.773s
sys 0m0.049s

# awk (@mklement0)
real    0m4.011s
user    0m3.959s
sys 0m0.045s

# perl
real    0m4.382s
user    0m4.291s
sys 0m0.063s

# sed
real    0m4.867s
user    0m4.816s
sys 0m0.044s

# sed  (no `g`)
real    0m4.510s
user    0m4.460s
sys 0m0.044s

Ubuntu:

# awk (@mklement0)
real    0m1.850s
user    0m1.788s
sys 0m0.020s

# awk (@jaypal)
real    0m2.055s
user    0m1.996s
sys 0m0.012s

# perl
real    0m2.349s
user    0m2.276s
sys 0m0.024s

# sed
real    0m8.278s
user    0m8.196s
sys 0m0.016s

# sed (no `g`)
real    0m7.580s
user    0m7.488s
sys 0m0.028s

Upvotes: 1

zx81
zx81

Reputation: 41838

Search: [ ](?=[^"]*" text=) (the [brackets] around the space are optional, they are there for clarity)

Replace: empty string.

In the regex demo, see the substitutions at the bottom.

Command-Line Syntax

I don't know the sed syntax to search and replace. With Perl (courtesy of @jaypal and @AvinashRaj):

perl -pe 's/ (?=[^"]*" text=)//g' file

From perl --help,

-p                assume loop like -n but print line also, like sed
-e program        one line of program (several -e's allowed, omit programfile)

Upvotes: 2

mklement0
mklement0

Reputation: 437753

Another awk solution:

 awk -F ' text="' '{ gsub(/ /, "", $1); print $1 FS $2 }' file
  • -F text="' splits each input line into the part before text=" ($1), and the part after ($2) - the -F option sets the special FS (*f*ield *s*eparator) awk variable to a regex that awk uses to split each input line into fields.
  • gsub(/ /, "", $1) (*g*lobal *sub*stitution) removes all spaces from $1 (the part before text="; replaces spaces with the empty string).
  • print $1 FS $2 prints the output: the modified $1 (spaces removed), joined with FS (i.e., text="), joined with $2 (the unmodified part after text=").

Upvotes: 1

jaypal singh
jaypal singh

Reputation: 77105

Using awk:

awk 'BEGIN{FS=OFS="\""}{gsub(/ /,"",$2)}1' file
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"

Upvotes: 4

perreal
perreal

Reputation: 97948

Using a substitution and a loop:

sed ':l s/\(number="[^" \t]*\)\s\s*/\1/g;tl' input

this one gives:

number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"

Upvotes: 3

Related Questions