Reputation: 5651
Consider the following lines:
mastectomy N
master NtVA
Words on the left are separated from one or multiple flags on the right (which indicate the possible parts-of-speech (POS) for the word in question, ie: whether it can be a noun, verb etc.) The two columns are tab-separated.
I'm trying to achieve the following list through a RegEx Search & Replace in my text editor:
mastectomy N
master N
master t
master V
master A
The goal is to make life my life easier working with the list in Excel (for vlookups.) The actual data is 230K lines long and case-sensitive (extracted from Moby List.)
So far what I've got is this:
[Find] ([a-z]+)\t([a-z]?)([a-z]?)([a-z]?)([a-z]?)
[Replace] \1\t\2\n\1\t\3\n\1\t\4\n\1\t\5
But this is not very elegant nor flexible and produces useless lines for words that have only 1 flag.
How can I improve it?
Thank you-
Fabien
Upvotes: 1
Views: 62
Reputation: 171
I have a simple solution using awk:
#!/bin/gawk -f
NF==2 {
STR=$2
while(length(STR)>0){
firstletter=substr(STR, 1, 1);
print $1" "firstletter;
STR=substr(STR, 2, length(STR));
}
}
which gives:
[col_expand $] cat input.dat
mastectomy N
master NtVA
[col_expand $]
[col_expand $] ./col_expand.awk input.dat
mastectomy N
master N
master t
master V
master A
[col_expand $]
Upvotes: 1
Reputation: 36272
Another approach could be to do the job from command-line with a scripting language like perl:
perl -ane '
@f = split //, $F[1];
printf qq|%s\t%s\n|, $F[0], shift @f while @f;
' infile
It yields:
mastectomy N
master N
master t
master V
master A
Upvotes: 1
Reputation: 1374
You could try running a replace like this until there are no replacements.
Use expression:
^(.+?)(\t[a-z])([a-z]+)
replace with:
\1\2\n\1\t\3
and run it until nothing can be replaced.
Upvotes: 1