Reputation: 375
My bash-foo is a little rusty right now so I wanted to see if there's a clever way to remove partial duplicates from a file. I have a bunch of files containing thousands of lines with the following format:
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
Essentially it's a bunch of pipe delimited strings, with the final two columns being a timestamp and x. What I'd like to do is concatenate all of my files and then remove all partial duplicates. I'm defining partial duplicate as a line in the file that matches from String1 up to String22, but the timestamp can be different.
For example, a file containing:
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 12:12:12|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
would become:
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
(It doesn't matter which timestamp is chosen).
Any ideas?
Upvotes: 1
Views: 235
Reputation: 67467
same idea with @anubhava, but I think more idiomatic
$ awk -F'|' '{line=$0;$NF=$(NF-1)=""} !a[$0]++{print line}' file
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
Upvotes: 0
Reputation: 8064
If you have Bash version 4, which supports associative arrays, it can be done fairly efficiently in pure Bash:
declare -A found
while IFS= read -r line || [[ -n $line ]] ; do
strings=${line%|*|*}
if (( ! ${found[$strings]-0} )) ; then
printf '%s\n' "$line"
found[$strings]=1
fi
done < "$file"
Upvotes: 0
Reputation: 784968
Using awk
you can do this:
awk '{k=$0; gsub(/(\|[^|]*){2}$/, "", k)} !seen[k]++' file
String1|String2|String3|String4|String5|String6|String7|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|String7|09-Apr-2016 05:28:03|x
awk
command first makes a variable k
by removing last 2 fields from each line. Then it uses an associative array seen
with key as k
where it prints only first instance of key by storing each processes key in the array.
Upvotes: 3