Reputation: 33
I have two text files file1.txt
and file2.txt
which both contain lines of words like this:
fare
word
word-ed
wo-ded
wor
and
fa-re
text
uncial
woded
wor
worded
or something like this. By a word, I mean a succession of the letters a-z possibly with accents, together with the symbol -
. My question is, how can I create a third file output.txt
from linux command line (using awk
, sed
etc.) out of these two files which satisfies the following three conditions:
output.txt
contains it exactly once.fa-re
in file2.txt) of a word in on file occurs in another, then only the hyphenated version is retained in output.txt (for example, only fa-re
is retained in our example).Thus, output.txt should contain the following words:
fa-re
word
word-ed
wo-ded
wor
text
uncial
================Edit========================
I have modified the files and given the output file as well. I will try to make sure manually that there are no differently hyphenated words (such as wod-ed and wo-ded).
Upvotes: 3
Views: 79
Reputation: 3646
!($1 in words) {
split($1, f, "-")
w = f[1] f[2]
if (f[2])
words[w] = $1
else
words[w]
}
END {
for (k in words)
if (words[k])
print words[k]
else
print k
}
$ awk -f script.awk file1.txt file2.txt
wor
fa-re
text
wo-ded
uncial
word-ed
word
!($1 in words) {
...
}
Only process the line if the first field doesn't already reside as a key in the array words
.
split($1, f, "-")
Splits the first field into the array f
using -
as the delimiter. The first and second parts of the word will reside in f[1]
and f[2]
respectively. If the word is not hyphened, it will reside in its entirety inside f[1]
.
w = f[1] f[2]
Assigns the dehyphened word to w
by concatenating the first and second parts of the word. If the word was not originally hyphened, the result will be the same since f[2]
is empty.
if (f[2])
words[w] = $1
else
words[w]
Store the dehyphened word as a key in the words
array. If the word was hyphened (f[2]
is not empty), store it as the key's value.
END {
for (k in words)
if (words[k])
print words[k]
else
print k
}
After the file has been processed, iterate through the words
array, and if the key holds a value (hyphened word), print it, otherwise print the key (non-hyphened word).
Upvotes: 1
Reputation: 67467
This is not exactly what you asked but perhaps better suited with what you need.
awk '{k=$1; gsub("-","",k); w[k]=$1 FS w[k]} END{for( i in w) print w[i]}'
this will group all words in the files by equivalence class (match without hyphen). You can have another pass from this result to get what you desire.
uncial
word
woded wo-ded
wor wor
worded word-ed
text
fa-re fare
The advantages are not manually checking whether there are alternative hyphenated words and see how many different instances you have for each word. For example, this will filter out the previous list to desired output.
awk '{w=$1; for(i=1;i<=NF;i++) if(match($i,/-/)!=0)w=$i; print w}'
Upvotes: 1
Reputation: 10865
Another awk:
!($1 in a) || $1 ~ "-" {
key = value = $1; gsub("-","",key); a[key] = value
}
END { for (i in a) print a[i] }
$ awk -f npr.awk file1.txt file2.txt
text
word-ed
uncial
wor
wo-ded
word
fa-re
Upvotes: 2