usr203050
usr203050

Reputation: 33

Merge two files with no pseudo-repetitions

I have two text files file1.txt and file2.txt which both contain lines of words like this: fare word word-ed wo-ded wor and

fa-re text uncial woded wor worded or something like this. By a word, I mean a succession of the letters a-z possibly with accents, together with the symbol -. My question is, how can I create a third file output.txt from linux command line (using awk, sed etc.) out of these two files which satisfies the following three conditions:

  1. If the same word occurs in the two files, the third file output.txt contains it exactly once.
  2. If a hyphenated version (for example fa-re in file2.txt) of a word in on file occurs in another, then only the hyphenated version is retained in output.txt (for example, only fa-re is retained in our example).

Thus, output.txt should contain the following words: fa-re word word-ed wo-ded wor text uncial

================Edit========================

I have modified the files and given the output file as well. I will try to make sure manually that there are no differently hyphenated words (such as wod-ed and wo-ded).

Upvotes: 3

Views: 79

Answers (3)

John B
John B

Reputation: 3646

Awk Solution

!($1 in words) {
    split($1, f, "-")
    w = f[1] f[2]
    if (f[2])
        words[w] = $1
    else
        words[w]
}
END {
    for (k in words)
        if (words[k])
            print words[k]
        else
            print k
}
$ awk -f script.awk file1.txt file2.txt
wor
fa-re
text
wo-ded
uncial
word-ed
word

Breakdown

!($1 in words) {
    ...
}

Only process the line if the first field doesn't already reside as a key in the array words.


split($1, f, "-")

Splits the first field into the array f using - as the delimiter. The first and second parts of the word will reside in f[1] and f[2] respectively. If the word is not hyphened, it will reside in its entirety inside f[1].


w = f[1] f[2]

Assigns the dehyphened word to w by concatenating the first and second parts of the word. If the word was not originally hyphened, the result will be the same since f[2] is empty.


if (f[2])
    words[w] = $1
else
    words[w]

Store the dehyphened word as a key in the words array. If the word was hyphened (f[2] is not empty), store it as the key's value.


END {
    for (k in words)
        if (words[k])
            print words[k]
        else
            print k
}

After the file has been processed, iterate through the words array, and if the key holds a value (hyphened word), print it, otherwise print the key (non-hyphened word).

Upvotes: 1

karakfa
karakfa

Reputation: 67467

This is not exactly what you asked but perhaps better suited with what you need.

awk '{k=$1; gsub("-","",k); w[k]=$1 FS w[k]} END{for( i in w) print w[i]}'

this will group all words in the files by equivalence class (match without hyphen). You can have another pass from this result to get what you desire.

uncial
word
woded wo-ded 
wor wor
worded word-ed
text
fa-re fare

The advantages are not manually checking whether there are alternative hyphenated words and see how many different instances you have for each word. For example, this will filter out the previous list to desired output.

awk '{w=$1; for(i=1;i<=NF;i++) if(match($i,/-/)!=0)w=$i; print w}'

Upvotes: 1

jas
jas

Reputation: 10865

Another awk:

!($1 in a) || $1 ~ "-" { 
    key = value = $1; gsub("-","",key); a[key] = value 
}
END { for (i in a) print a[i] }

$ awk -f npr.awk file1.txt file2.txt
text
word-ed
uncial
wor
wo-ded
word
fa-re

Upvotes: 2

Related Questions