Reputation: 11799

Eliminate duplicate words across lines

I'd like a sed script that eliminates repeated words in a text file on one or more lines. For example:

this is is is a text file file it is littered with duplicate words
words words on one or more lines lines
lines
  lines

should transform to:

this is a text file it is littered with duplicate words
on one or more lines

This awk script produces the correct output:

{
    for (i = 1; i <= NF; i++) {
        word = $i

        if (word != last) {
            if (i < NF) {
                next_word = $(i+1)

                if (word != next_word) {
                    printf("%s ", word)
                }
            } else {
                printf("%s\n", word)
            }
        }
    }

    last = word
}

but I'd really like a sed "one-liner".

Upvotes: 1

Answers (3)

glenn jackman

Reputation: 246847

sed -En '
    H
    ${
        g
        s/^\n//
        s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
        p
    }
' file

This is a text file with duplicate words
on one or more lines

where

H -- append each line to the hold space
${...} -- on the last line, perform the enclosed commands
g -- replace pattern space with the contents of the hold space
s/^\n// -- remove leading newline (side-effect of H on first line)
s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
..1..2............2............1..........................
- the key here is to capture the text and the spaces separately so that the back reference can match with differing whitespace.
- captured expression #1 is the first word and it's whitespace (which can contain newlines), and the capture #2 is just the word.

Upvotes: 0

Alain Merigot

Reputation: 11547

With sed, you can use

sed -E 's/([a-z]+) +\1/\1/g'

Note that it works for duplicates. Not for triplicates or line breaks.

This can be fixed, by joining all the lines and looping.

sed -E ':a;N;s/(\b[a-z]+\b)([ \n])[ \n]*\b\1\b */\1\2/g;ba'

Upvotes: 0

Benjamin W.

Reputation: 52152

This works with GNU sed, at least for the example input:

$ sed -Ez ':a;s/(\<\S+)(\s+)\1\s+/\1\2/g;ta' infile
This is a text file and is littered with duplicate words
on one or more lines

The -E option is just there to avoid having to escape the capture group parentheses and + quantifiers.

-z treats the input as null byte separated, i.e., as a single line.

The commmand is then structured as

:a      # label
s///g   # substitution
ta      # jump to label if substitution did something

And the substitution is this:

s/(\<\S+)(\s+)\1\s+/\1\2/g

First capture group: (\<\S+) – a complete word (start of word boundary, one or more non-space characters
Second capture group: (\s+) – any number of blanks after that first word
\1\s+ – the first word again plus whatever blanks follow it

This preserves the whitespace after the first word and discards the whitespace after the duplicate.

Note that -E, -z, \<, \S and \s are all GNU extensions to POSIX sed.

Upvotes: 1

Eliminate duplicate words across lines

Answers (3)

Related Questions