Reputation: 170

Removing specific character from anywhere between two specific strings?

I have a large text file that contains content as per the below example:

number="+123 123 123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456 789" text="Numbers here should keep their spaces"
number="+9 8 7 6 5" text="example 123 123 123"

What I would like is to remove any whitespace character between two identifying strings, in this case number= and " text= without touching the rest of the line. So that the desired output would be:

number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"

A regex like (?<=[0-9])(\s)(?=[0-9]) will interfere with with text field, which is undesirable.

I have tested a few variations of using something along the lines of (?<=address)(\s)(?=date) but this doesn't work. I think the problem lies in not being able to deal with the extra possible numbers in between the whitespace and the markers?

Adding wildcard matches into the lookbehinds/lookaheads such as (?<=address.*)(\s)(?=.*date) doesn't seem valid or else I've done it wrong? Also making the whitespace lazy with (/s+?) doesn't seem to help me, but this is about where my knowledge of regex really falls to pieces :)

Ideally I would also like to restrict between the extra equals and quotes characters for safety. I.e number=" at the beginning marker and text=" as the end marker.

Any sed/awk or similar solutions are also welcomed if easier.

Upvotes: 2

Answers (5)

mklement0

Reputation: 437753

Note: This is a complement to the existing answers to compare their performance.

Test environments:

OS X 10.9.4.
- FreeBSD awk 20070501
- FreeBSD sed (cannot tell version number)
- Perl v5.16.2
Ubuntu 14.04
- GNU Awk 4.0.1
- sed (GNU sed) 4.2.2
- Perl v5.18.2

The short of it:

The awk solutions are fastest.
- On OS X, @jaypal's solution is faster, on Ubuntu it's @mklement0's (mine).
Followed by the perl solution.
The sed solution (accepted answer) is slowest.
- Note that removing the unnecessary g option does improve things measurably, but doesn't change the big picture.

On OS X, the differences aren't dramatic.
On Ubuntu, the differences between the awk and the perl solutions are small, but the sed solution is dramatically slower.

Sample numbers, running against a 100,000-line input file 10 times. Don't compare them directly (Ubuntu is running in a VM on the OS X machine), just look at their ratios. (Curiously, though, awk and perl ran faster in the Ubuntu VM):

OS X:

# awk (@japyal)
real    0m3.848s
user    0m3.773s
sys 0m0.049s

# awk (@mklement0)
real    0m4.011s
user    0m3.959s
sys 0m0.045s

# perl
real    0m4.382s
user    0m4.291s
sys 0m0.063s

# sed
real    0m4.867s
user    0m4.816s
sys 0m0.044s

# sed  (no `g`)
real    0m4.510s
user    0m4.460s
sys 0m0.044s

Ubuntu:

# awk (@mklement0)
real    0m1.850s
user    0m1.788s
sys 0m0.020s

# awk (@jaypal)
real    0m2.055s
user    0m1.996s
sys 0m0.012s

# perl
real    0m2.349s
user    0m2.276s
sys 0m0.024s

# sed
real    0m8.278s
user    0m8.196s
sys 0m0.016s

# sed (no `g`)
real    0m7.580s
user    0m7.488s
sys 0m0.028s

Upvotes: 1

zx81

Reputation: 41838

Search: [ ](?=[^"]*" text=) (the [brackets] around the space are optional, they are there for clarity)

Replace: empty string.

In the regex demo, see the substitutions at the bottom.

Command-Line Syntax

I don't know the sed syntax to search and replace. With Perl (courtesy of @jaypal and @AvinashRaj):

perl -pe 's/ (?=[^"]*" text=)//g' file

From perl --help,

-p                assume loop like -n but print line also, like sed
-e program        one line of program (several -e's allowed, omit programfile)

Upvotes: 2

mklement0

Reputation: 437753

Another awk solution:

 awk -F ' text="' '{ gsub(/ /, "", $1); print $1 FS $2 }' file

-F text="' splits each input line into the part before text=" ($1), and the part after ($2) - the -F option sets the special FS (*f*ield *s*eparator) awk variable to a regex that awk uses to split each input line into fields.
gsub(/ /, "", $1) (*g*lobal *sub*stitution) removes all spaces from $1 (the part before text="; replaces spaces with the empty string).
print $1 FS $2 prints the output: the modified $1 (spaces removed), joined with FS (i.e., text="), joined with $2 (the unmodified part after text=").

Upvotes: 1

jaypal singh

Reputation: 77105

Using awk:

awk 'BEGIN{FS=OFS="\""}{gsub(/ /,"",$2)}1' file
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"

Upvotes: 4

perreal

Reputation: 97948

Using a substitution and a loop:

sed ':l s/\(number="[^" \t]*\)\s\s*/\1/g;tl' input

this one gives:

number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"

Upvotes: 3

Removing specific character from anywhere between two specific strings?

Answers (5)

Related Questions