Reputation: 170
I have a large text file that contains content as per the below example:
number="+123 123 123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456 789" text="Numbers here should keep their spaces"
number="+9 8 7 6 5" text="example 123 123 123"
What I would like is to remove any whitespace character between two identifying strings, in this case number=
and " text=
without touching the rest of the line. So that the desired output would be:
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"
A regex like (?<=[0-9])(\s)(?=[0-9])
will interfere with with text field, which is undesirable.
I have tested a few variations of using something along the lines of (?<=address)(\s)(?=date)
but this doesn't work. I think the problem lies in not being able to deal with the extra possible numbers in between the whitespace and the markers?
Adding wildcard matches into the lookbehinds/lookaheads such as (?<=address.*)(\s)(?=.*date)
doesn't seem valid or else I've done it wrong? Also making the whitespace lazy with (/s+?)
doesn't seem to help me, but this is about where my knowledge of regex really falls to pieces :)
Ideally I would also like to restrict between the extra equals and quotes characters for safety. I.e number="
at the beginning marker and text="
as the end marker.
Any sed/awk or similar solutions are also welcomed if easier.
Upvotes: 2
Views: 170
Reputation: 437753
Note: This is a complement to the existing answers to compare their performance.
Test environments:
The short of it:
awk
solutions are fastest.
perl
solution.sed
solution (accepted answer) is slowest.
g
option does improve things measurably, but doesn't change the big picture.On OS X, the differences aren't dramatic.
On Ubuntu, the differences between the awk
and the perl
solutions are small, but the sed
solution is dramatically slower.
Sample numbers, running against a 100,000-line input file 10 times.
Don't compare them directly (Ubuntu is running in a VM on the OS X machine), just look at their ratios. (Curiously, though, awk
and perl
ran faster in the Ubuntu VM):
OS X:
# awk (@japyal) real 0m3.848s user 0m3.773s sys 0m0.049s # awk (@mklement0) real 0m4.011s user 0m3.959s sys 0m0.045s # perl real 0m4.382s user 0m4.291s sys 0m0.063s # sed real 0m4.867s user 0m4.816s sys 0m0.044s # sed (no `g`) real 0m4.510s user 0m4.460s sys 0m0.044s
Ubuntu:
# awk (@mklement0) real 0m1.850s user 0m1.788s sys 0m0.020s # awk (@jaypal) real 0m2.055s user 0m1.996s sys 0m0.012s # perl real 0m2.349s user 0m2.276s sys 0m0.024s # sed real 0m8.278s user 0m8.196s sys 0m0.016s # sed (no `g`) real 0m7.580s user 0m7.488s sys 0m0.028s
Upvotes: 1
Reputation: 41838
Search: [ ](?=[^"]*" text=)
(the [brackets]
around the space are optional, they are there for clarity)
Replace: empty string.
In the regex demo, see the substitutions at the bottom.
Command-Line Syntax
I don't know the sed syntax to search and replace. With Perl (courtesy of @jaypal and @AvinashRaj):
perl -pe 's/ (?=[^"]*" text=)//g' file
From perl --help
,
-p assume loop like -n but print line also, like sed
-e program one line of program (several -e's allowed, omit programfile)
Upvotes: 2
Reputation: 437753
Another awk
solution:
awk -F ' text="' '{ gsub(/ /, "", $1); print $1 FS $2 }' file
-F text="'
splits each input line into the part before text="
($1
), and the part after ($2
) - the -F
option sets the special FS
(*f*ield *s*eparator) awk
variable to a regex that awk
uses to split each input line into fields.gsub(/ /, "", $1)
(*g*lobal *sub*stitution) removes all spaces from $1
(the part before text="
; replaces spaces with the empty string).print $1 FS $2
prints the output: the modified $1
(spaces removed), joined with FS
(i.e., text="
), joined with $2
(the unmodified part after text="
).Upvotes: 1
Reputation: 77105
Using awk
:
awk 'BEGIN{FS=OFS="\""}{gsub(/ /,"",$2)}1' file
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"
Upvotes: 4
Reputation: 97948
Using a substitution and a loop:
sed ':l s/\(number="[^" \t]*\)\s\s*/\1/g;tl' input
this one gives:
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"
Upvotes: 3