Jan Nielsen
Jan Nielsen

Reputation: 11799

Eliminate duplicate words across lines

I'd like a sed script that eliminates repeated words in a text file on one or more lines. For example:

this is is is a text file file it is littered with duplicate words
words words on one or more lines lines
lines
  lines

should transform to:

this is a text file it is littered with duplicate words
on one or more lines

This awk script produces the correct output:

{
    for (i = 1; i <= NF; i++) {
        word = $i

        if (word != last) {
            if (i < NF) {
                next_word = $(i+1)

                if (word != next_word) {
                    printf("%s ", word)
                }
            } else {
                printf("%s\n", word)
            }
        }
    }

    last = word
}

but I'd really like a sed "one-liner".

Upvotes: 1

Views: 46

Answers (3)

glenn jackman
glenn jackman

Reputation: 246847

sed -En '
    H
    ${
        g
        s/^\n//
        s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
        p
    }
' file
This is a text file with duplicate words
on one or more lines

where

  • H -- append each line to the hold space
  • ${...} -- on the last line, perform the enclosed commands
  • g -- replace pattern space with the contents of the hold space
  • s/^\n// -- remove leading newline (side-effect of H on first line)
  • s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
    ..1..2............2............1..........................

    • the key here is to capture the text and the spaces separately so that the back reference can match with differing whitespace.
    • captured expression #1 is the first word and it's whitespace (which can contain newlines), and the capture #2 is just the word.

Upvotes: 0

Alain Merigot
Alain Merigot

Reputation: 11547

With sed, you can use

sed -E 's/([a-z]+) +\1/\1/g'

Note that it works for duplicates. Not for triplicates or line breaks.

This can be fixed, by joining all the lines and looping.

sed -E ':a;N;s/(\b[a-z]+\b)([ \n])[ \n]*\b\1\b */\1\2/g;ba'

Upvotes: 0

Benjamin W.
Benjamin W.

Reputation: 52152

This works with GNU sed, at least for the example input:

$ sed -Ez ':a;s/(\<\S+)(\s+)\1\s+/\1\2/g;ta' infile
This is a text file and is littered with duplicate words
on one or more lines

The -E option is just there to avoid having to escape the capture group parentheses and + quantifiers.

-z treats the input as null byte separated, i.e., as a single line.

The commmand is then structured as

:a      # label
s///g   # substitution
ta      # jump to label if substitution did something

And the substitution is this:

s/(\<\S+)(\s+)\1\s+/\1\2/g
  • First capture group: (\<\S+) – a complete word (start of word boundary, one or more non-space characters
  • Second capture group: (\s+) – any number of blanks after that first word
  • \1\s+ – the first word again plus whatever blanks follow it

This preserves the whitespace after the first word and discards the whitespace after the duplicate.

Note that -E, -z, \<, \S and \s are all GNU extensions to POSIX sed.

Upvotes: 1

Related Questions