Reputation: 1167

Remove consecutive duplicate words from a file using awk or sed

My input file looks like below:

“true true, rohith Rohith;
cold burn, and fact and fact good good?”

Output shoud look like:

"true, rohith Rohith;
cold burn, and fact and fact good?"

i am trying the same with awk, but couldn't able to get the desired result.

awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s ",$i,FS)}{printf("\n")}' input.txt

Could someone please help me here.

Regards, Rohith

Upvotes: 0

Answers (6)

rvbarreto

Reputation: 691

sed -E 's/(\w+) *\1/\1/g' sample.txt

sample.txt

“true true, rohith Rohith;
cold burn, and fact and fact good good?”

output:

:~$ sed -E 's/(\w+) *\1/\1/g' sample.txt
“true, rohith Rohith;
cold burn, and fact and fact good?”

Explanation

(\w) *\1 - matches a word separated by a space of the same word and saves it

Upvotes: 1

Walter A

Reputation: 20002

Simple sed:

echo "true true, rohith Rohith;
cold burn, and fact and fact good good?" | sed -r 's/(\w+) (\1)/\1/g'

Upvotes: 3

KamilCuk

Reputation: 141020

Just match the same backreference in sed:

sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'

How it works:

:l - create a label l to jump to. See tl below.
s - substitute
- /
- $^\|[^[:alpha:]]$ - match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.
- $[[:alpha:]]\{1,\}$ - match a word - one or more alphabetic characters.
- [^[:alpha:]]\{1,\} - match a non-word - one or more non-alphabetic characters.
- \2 - match the same thing as in the second $...$ - ie. match the word.
- $$\|[^[:alpha:]]$ - match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix.
- /
- \1\2\3 - substitute it for <beginning of the line or non-alphabetic prefix character><the word><end of the line or non-alphabetic suffix character found>
- /
- g - substitute globally. But, because regex is never going back, it will substitute 2 words at a time.
tl - Jump to label l if last s command was successfull. This is here, so that when there are 3 words the same, like true true true, they are properly replaced by a single true.

Without the $^\|[^[:alpha:]]$ and $$\|[^[:alpha:]]$, without them for example true rue would be substituted by true, because the suffix rue rue would match.

Below are my other solution, which also remove repeated words across lines.

My first solution was with uniq. So first I will transform the input into pairs with the format <non-alphabetical sequence separating words encoded in hex> <a word>. Then run it via uniq -f1 with ignoring first field and then convert back. This will be very slow:

# recreate input
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
# insert zero byte after each word and non-word
# the -z option is from GNU sed
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
# for each pair (non-word, word)
xargs -0 -n2 sh -c '
    # ouptut hexadecimal representation of non-word
    printf "%s" "$1" | xxd -p | tr -d "\n"
    # and output space with the word
    printf " %s\n" "$2"
' -- |
# uniq ignores empty fields - so make sure field1 always has something
sed 's/^/-/' |
# uniq while ignoring first field
uniq -f1 |
# for each pair (non-word in hex, word)
xargs -n2 bash -c '
    # just `printf "%s" "$1" | sed 's/^-//' | xxd -r -p` for posix shell
    # change non-word from hex to characters
    printf "%s" "${1:1}" | xxd -r -p
    # output word
    printf "%s" "$2"
' --

But then I noticed that sed is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:

cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
gawk -vRS='\0' '
NR%2==1{
    nonword=$0
}
NR%2==0{
    if (length(lastword) && lastword != $0) {
        printf "%s%s", lastword, nonword
    }
    lastword=$0
}
END{
    printf "%s%s", lastword, nonword
}'

In place of zero byte something unique could be used as record separator, for example ^ character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:

cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r 's/[[:alpha:]]+/^&^/g' |
awk -vRS='^' '
    NR%2{ n=$0 }
    NR%2-1 && length(l) && l != $0 { printf "%s%s", l, n }
    NR%2-1 { l=$0 }
    END { printf "%s%s", l, n }
'

Tested on repl. The snippets output:

true, rohith Rohith;
cold burn, and fact and fact good?

Upvotes: 3

Ed Morton

Reputation: 203502

With GNU awk for the 4th arg to split():

$ cat tst.awk
{
    n = split($0,words,/[^[:alpha:]]+/,seps)
    prev = ""
    for (i=1; i<=n; i++) {
        word = words[i]
        if (word != prev) {
            printf "%s%s", seps[i-1], word
        }
        prev = word
    }
    print ""
}

$ awk -f tst.awk file
“true, rohith Rohith;
cold burn, and fact and fact good?”

Upvotes: 5

Marcelo Castro

Reputation: 21

Depending on your expected input, this might work:

sed -r 's/([a-zA-Z0-9_-]+)( *)\1/\1\2/g ; s/ ([.,;:])/\1/g ; s/  / /g' myfile

([a-zA-Z0-9_-]+) = words that might be repeated.

( *)\1 = check if the previous word is repeated after a space.

s/ ([.,;:])/\1/g = removes extra spaces before punctuation (you might want to add characters to this group).

s/ / /g = removes double spaces.

This works with GNU sed.

Upvotes: 0

anubhava

Reputation: 785146

This is not exactly what you have shown in output but is close using gnu-awk:

awk -v RS='[^-_[:alnum:]]+' '$1 == p{printf "%s", RT; next} {p=$1; ORS=RT} 1' file

“true , rohith Rohith;
cold burn, and fact and fact good ?”

Upvotes: 1

Remove consecutive duplicate words from a file using awk or sed

Answers (6)

Related Questions