Wang
Wang

Reputation: 8192

How can I output the number of repeats of a pattern in regex?

I would like to output the number of repeats of a pattern with regex. For example, convert "aaad" to "3xad", "bCCCCC" to "b5xC". I want to do this in sed or awk.

I know I can match it by (.)\1+ or even capture it by ((.)\1+). But how can I obtain the times of repeating and insert that value back to string in regex or sed or awk?

Upvotes: 0

Views: 98

Answers (4)

James Brown
James Brown

Reputation: 37414

In GNU awk:

$ echo aaadbCCCCCxx |  awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) {
        c=$i
        match(substr($0,i),c"+")
        b=b (RLENGTH>1?RLENGTH "x":"") c
    }
    print b
}'
3xadb5xC2xx

If the regex metachars want to be read as literal characters as noted in the comments one could try to detect and escape them (solution below is only directional):

$ echo \\\\\\..**aaadbCCCCC++xx |
awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) { 
        c=$i                               
        # print i,c                        # for debugging
        if(c~/[*.\\]/)                     # if c is a regex metachar (not complete)
            c="\\"c                        # escape it
        match(substr($0,i),c"+")           # find all c:s
        b=b (RLENGTH>1?RLENGTH "x":"") $i  # buffer to b
    }
    print b
}'
3x\2x.2x*3xadb5xC2x+2xx

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203792

I was hoping we'd have a MCVE by now but we don't so what the heck - here is my best guess at what you're trying to do:

$ cat tst.awk
{
    out = ""
    for (pos=1; pos<=length($0); pos+=reps) {
        char = substr($0,pos,1)
        for (reps=1; char == substr($0,pos+reps,1); reps++);
        out = out (reps > 1 ? reps "x" : "") char
    }
    print out
}

$ awk -f tst.awk file
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

The above was run against the sample input that @Thor kindly provided:

$ cat file
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

The above will work for any input characters using any awk in any shell on any UNIX box. If you need to make it case-insensitive just throw a tolower() around each side of the comparison in the innermost for loop. If you need it to work on multi-character strings then you'll have to tell us how to identify where the substrings you're interested in start/end.

Upvotes: 1

Thor
Thor

Reputation: 47129

Just for fun.

With sed it is cumbersome but do-able. Note this example relies on GNU sed (:

parse.sed

/(.)\1+/ {
  : nextrepetition
  /((.)\2+)/ s//\n\1\n/             # delimit the repetition with new-lines
  h                                 # and store the delimited version
  s/^[^\n]*\n|\n[^\n]*$//g          # now remove prefix and suffix
  b charcount                       # count repetitions
  : aftercharcount                  # return here after counting
  G                                 # append the new-line delimited version

  # Reorganize pattern space to the desired format
  s/^([^\n]+)\n([^\n]*)\n(.)[^\n]+\n/\2\1x\3/

  # Run again if more repetitions exist
  /(.)\1+/b nextrepetition
}

b

# Adapted from the wc -c example in the sed manual
# Ref: https://www.gnu.org/software/sed/manual/sed.html#wc-_002dc
: charcount

s/./a/g

# Do the carry.  The t's and b's are not necessary,
# but they do speed up the thing
t a
: a;  s/aaaaaaaaaa/b/g; t b; b done
: b;  s/bbbbbbbbbb/c/g; t c; b done
: c;  s/cccccccccc/d/g; t d; b done
: d;  s/dddddddddd/e/g; t e; b done
: e;  s/eeeeeeeeee/f/g; t f; b done
: f;  s/ffffffffff/g/g; t g; b done
: g;  s/gggggggggg/h/g; t h; b done
: h;  s/hhhhhhhhhh//g

: done

# On the last line, convert back to decimal

: loop
/a/! s/[b-h]*/&0/
s/aaaaaaaaa/9/
s/aaaaaaaa/8/
s/aaaaaaa/7/
s/aaaaaa/6/
s/aaaaa/5/
s/aaaa/4/
s/aaa/3/
s/aa/2/
s/a/1/

y/bcdefgh/abcdefg/
/[a-h]/ b loop

b aftercharcount

Run it like this:

sed -Ef parse.sed infile

With an infile like this:

aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

The output is:

3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

Upvotes: 1

choroba
choroba

Reputation: 241938

Perl to the rescue!

perl -pe 's/((.)\2+)/length($1) . "x$2"/ge'
  • -p reads the input line by line and prints it after processing
  • s/// is the substitution similar to sed
  • /e makes the replacement evaluated as code

e.g.

aaadbCCCCCxx -> 3xadb5xC2xx

Upvotes: 4

Related Questions