Fabien Snauwaert
Fabien Snauwaert

Reputation: 5651

RegEx to produce duplicate lines

Consider the following lines:

mastectomy  N
master  NtVA

Words on the left are separated from one or multiple flags on the right (which indicate the possible parts-of-speech (POS) for the word in question, ie: whether it can be a noun, verb etc.) The two columns are tab-separated.

I'm trying to achieve the following list through a RegEx Search & Replace in my text editor:

mastectomy  N
master  N
master  t
master  V
master  A

The goal is to make life my life easier working with the list in Excel (for vlookups.) The actual data is 230K lines long and case-sensitive (extracted from Moby List.)

So far what I've got is this:

[Find] ([a-z]+)\t([a-z]?)([a-z]?)([a-z]?)([a-z]?)

[Replace] \1\t\2\n\1\t\3\n\1\t\4\n\1\t\5

But this is not very elegant nor flexible and produces useless lines for words that have only 1 flag.

How can I improve it?

Thank you-

Fabien

Upvotes: 1

Views: 62

Answers (3)

user3065349
user3065349

Reputation: 171

I have a simple solution using awk:

#!/bin/gawk -f

NF==2 {
STR=$2
while(length(STR)>0){
    firstletter=substr(STR, 1, 1);
    print $1" "firstletter;
    STR=substr(STR, 2, length(STR));
}
}

which gives:

[col_expand $] cat input.dat
mastectomy N
master NtVA

[col_expand $] 
[col_expand $] ./col_expand.awk input.dat
mastectomy N
master N
master t
master V
master A
[col_expand $]

Upvotes: 1

Birei
Birei

Reputation: 36272

Another approach could be to do the job from command-line with a scripting language like :

perl -ane '
    @f = split //, $F[1]; 
    printf qq|%s\t%s\n|, $F[0], shift @f while @f;
' infile

It yields:

mastectomy  N
master  N
master  t
master  V
master  A

Upvotes: 1

JonM
JonM

Reputation: 1374

You could try running a replace like this until there are no replacements.

Use expression:

^(.+?)(\t[a-z])([a-z]+)

replace with:

\1\2\n\1\t\3

and run it until nothing can be replaced.

Upvotes: 1

Related Questions