Reputation: 11799
I'd like a sed script that eliminates repeated words in a text file on one or more lines. For example:
this is is is a text file file it is littered with duplicate words
words words on one or more lines lines
lines
lines
should transform to:
this is a text file it is littered with duplicate words
on one or more lines
This awk script produces the correct output:
{
for (i = 1; i <= NF; i++) {
word = $i
if (word != last) {
if (i < NF) {
next_word = $(i+1)
if (word != next_word) {
printf("%s ", word)
}
} else {
printf("%s\n", word)
}
}
}
last = word
}
but I'd really like a sed "one-liner".
Upvotes: 1
Views: 46
Reputation: 246847
sed -En '
H
${
g
s/^\n//
s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
p
}
' file
This is a text file with duplicate words
on one or more lines
where
H
-- append each line to the hold space${...}
-- on the last line, perform the enclosed commandsg
-- replace pattern space with the contents of the hold spaces/^\n//
-- remove leading newline (side-effect of H
on first line)s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
..1..2............2............1..........................
Upvotes: 0
Reputation: 11547
With sed, you can use
sed -E 's/([a-z]+) +\1/\1/g'
Note that it works for duplicates. Not for triplicates or line breaks.
This can be fixed, by joining all the lines and looping.
sed -E ':a;N;s/(\b[a-z]+\b)([ \n])[ \n]*\b\1\b */\1\2/g;ba'
Upvotes: 0
Reputation: 52152
This works with GNU sed, at least for the example input:
$ sed -Ez ':a;s/(\<\S+)(\s+)\1\s+/\1\2/g;ta' infile
This is a text file and is littered with duplicate words
on one or more lines
The -E
option is just there to avoid having to escape the capture group parentheses and +
quantifiers.
-z
treats the input as null byte separated, i.e., as a single line.
The commmand is then structured as
:a # label
s///g # substitution
ta # jump to label if substitution did something
And the substitution is this:
s/(\<\S+)(\s+)\1\s+/\1\2/g
(\<\S+)
– a complete word (start of word boundary, one or more non-space characters(\s+)
– any number of blanks after that first word\1\s+
– the first word again plus whatever blanks follow itThis preserves the whitespace after the first word and discards the whitespace after the duplicate.
Note that -E
, -z
, \<
, \S
and \s
are all GNU extensions to POSIX sed.
Upvotes: 1