Reputation: 8192
I would like to output the number of repeats of a pattern with regex. For example, convert "aaad"
to "3xad"
, "bCCCCC"
to "b5xC"
. I want to do this in sed
or awk
.
I know I can match it by (.)\1+
or even capture it by ((.)\1+)
. But how can I obtain the times of repeating and insert that value back to string in regex or sed or awk?
Upvotes: 0
Views: 98
Reputation: 37414
In GNU awk:
$ echo aaadbCCCCCxx | awk -F '' '{
for(i=1;i<=NF;i+=RLENGTH) {
c=$i
match(substr($0,i),c"+")
b=b (RLENGTH>1?RLENGTH "x":"") c
}
print b
}'
3xadb5xC2xx
If the regex metachars want to be read as literal characters as noted in the comments one could try to detect and escape them (solution below is only directional):
$ echo \\\\\\..**aaadbCCCCC++xx |
awk -F '' '{
for(i=1;i<=NF;i+=RLENGTH) {
c=$i
# print i,c # for debugging
if(c~/[*.\\]/) # if c is a regex metachar (not complete)
c="\\"c # escape it
match(substr($0,i),c"+") # find all c:s
b=b (RLENGTH>1?RLENGTH "x":"") $i # buffer to b
}
print b
}'
3x\2x.2x*3xadb5xC2x+2xx
Upvotes: 2
Reputation: 203792
I was hoping we'd have a MCVE by now but we don't so what the heck - here is my best guess at what you're trying to do:
$ cat tst.awk
{
out = ""
for (pos=1; pos<=length($0); pos+=reps) {
char = substr($0,pos,1)
for (reps=1; char == substr($0,pos+reps,1); reps++);
out = out (reps > 1 ? reps "x" : "") char
}
print out
}
$ awk -f tst.awk file
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa
The above was run against the sample input that @Thor kindly provided:
$ cat file
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa
The above will work for any input characters using any awk in any shell on any UNIX box. If you need to make it case-insensitive just throw a tolower()
around each side of the comparison in the innermost for
loop. If you need it to work on multi-character strings then you'll have to tell us how to identify where the substrings you're interested in start/end.
Upvotes: 1
Reputation: 47129
Just for fun.
With sed
it is cumbersome but do-able. Note this example relies on GNU sed (:
parse.sed
/(.)\1+/ {
: nextrepetition
/((.)\2+)/ s//\n\1\n/ # delimit the repetition with new-lines
h # and store the delimited version
s/^[^\n]*\n|\n[^\n]*$//g # now remove prefix and suffix
b charcount # count repetitions
: aftercharcount # return here after counting
G # append the new-line delimited version
# Reorganize pattern space to the desired format
s/^([^\n]+)\n([^\n]*)\n(.)[^\n]+\n/\2\1x\3/
# Run again if more repetitions exist
/(.)\1+/b nextrepetition
}
b
# Adapted from the wc -c example in the sed manual
# Ref: https://www.gnu.org/software/sed/manual/sed.html#wc-_002dc
: charcount
s/./a/g
# Do the carry. The t's and b's are not necessary,
# but they do speed up the thing
t a
: a; s/aaaaaaaaaa/b/g; t b; b done
: b; s/bbbbbbbbbb/c/g; t c; b done
: c; s/cccccccccc/d/g; t d; b done
: d; s/dddddddddd/e/g; t e; b done
: e; s/eeeeeeeeee/f/g; t f; b done
: f; s/ffffffffff/g/g; t g; b done
: g; s/gggggggggg/h/g; t h; b done
: h; s/hhhhhhhhhh//g
: done
# On the last line, convert back to decimal
: loop
/a/! s/[b-h]*/&0/
s/aaaaaaaaa/9/
s/aaaaaaaa/8/
s/aaaaaaa/7/
s/aaaaaa/6/
s/aaaaa/5/
s/aaaa/4/
s/aaa/3/
s/aa/2/
s/a/1/
y/bcdefgh/abcdefg/
/[a-h]/ b loop
b aftercharcount
Run it like this:
sed -Ef parse.sed infile
With an infile
like this:
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa
The output is:
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa
Upvotes: 1
Reputation: 241938
Perl to the rescue!
perl -pe 's/((.)\2+)/length($1) . "x$2"/ge'
-p
reads the input line by line and prints it after processings///
is the substitution similar to sed/e
makes the replacement evaluated as codee.g.
aaadbCCCCCxx -> 3xadb5xC2xx
Upvotes: 4