Reputation: 375

Remove partial duplicates from text file

My bash-foo is a little rusty right now so I wanted to see if there's a clever way to remove partial duplicates from a file. I have a bunch of files containing thousands of lines with the following format:

String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x

Essentially it's a bunch of pipe delimited strings, with the final two columns being a timestamp and x. What I'd like to do is concatenate all of my files and then remove all partial duplicates. I'm defining partial duplicate as a line in the file that matches from String1 up to String22, but the timestamp can be different.

For example, a file containing:

String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 12:12:12|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x

would become:

String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x

(It doesn't matter which timestamp is chosen).

Any ideas?

Upvotes: 1

Answers (3)

karakfa

Reputation: 67567

same idea with @anubhava, but I think more idiomatic

$ awk -F'|' '{line=$0;$NF=$(NF-1)=""} !a[$0]++{print line}' file

String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x

Upvotes: 0

pjh

Reputation: 8209

If you have Bash version 4, which supports associative arrays, it can be done fairly efficiently in pure Bash:

declare -A found
while IFS= read -r line || [[ -n $line ]] ; do
    strings=${line%|*|*}
    if (( ! ${found[$strings]-0} )) ; then
        printf '%s\n' "$line"
        found[$strings]=1
    fi
done < "$file"

Upvotes: 0

anubhava

Reputation: 786289

Using awk you can do this:

awk '{k=$0; gsub(/(\|[^|]*){2}$/, "", k)} !seen[k]++' file

String1|String2|String3|String4|String5|String6|String7|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|String7|09-Apr-2016 05:28:03|x

awk command first makes a variable k by removing last 2 fields from each line. Then it uses an associative array seen with key as k where it prints only first instance of key by storing each processes key in the array.

Upvotes: 3

Remove partial duplicates from text file

Answers (3)

Related Questions