Reputation: 279

Removing all duplicate entries in a field

I have a file that is of the following format:

text   number   number   A;A;A;A;A;A
text   number   number   B
text   number   number   C;C;C;C;D;C;C;C;C

What I want to do is remove all repeats of the entries in the fourth column to end up with this:

text   number   number   A
text   number   number   B
text   number   number   C;D

I'd prefer to use bash scripting for a solution to fit into a pipe with other text manipulation I'm doing to this file.

Thanks!

Upvotes: 4

Answers (4)

Thor

Reputation: 47099

Assuming tab-delimited input, you could do it like this with GNU parallel:

parallel -C '\t' c4='$(echo {4} | tr ";" "\n" | sort -u | head -c-1 | tr "\n" ";");' \
                 echo -e '"{1}\t{2}\t{3}\t$c4"' :::: infile

Output:

text    number  number  A
text    number  number  B
text    number  number  C;D

Upvotes: 2

doubleDown

Reputation: 8398

This might work too

awk -F";" '{
              delete words
              match($1,/[[:alpha:]]$/)
              words[substr($1,RSTART,RLENGTH)]++
              printf "%s",$1
              for (i=2;i<=NF;i++){
                if (!words[$i]++) printf ";%s",$i
              }
              printf "\n"
           }' file

Notes:

Since ; is used as field separator, it doesn't matter how many columns (or which delimiters are used for those columns) before A;A;A;A;A;A
/[[:alpha:]]$/ can be replaced with /[^[:space:]]+$/ to match multiple non-space characters instead of a single alphabet.
if (!words[$i]++) printf ";%s",$i prints the column/character if it doesn't exist as a key for the associative array words, i.e. if words[$i] is 0.

Upvotes: 1

potong

Reputation: 58371

This might work for you (GNU sed):

sed 's/.*\s/&\n/;h;s/.*\n//;:a;s/\(\([^;]\).*\);\2/\1/;ta;H;g;s/\n.*\n//' file

Upvotes: 2

iruvar

Reputation: 23374

can achieve this using awk. Split field 4 into an array using ; first

awk '{delete z; d=""; split($4,arr,";");for (k in arr) z[arr[k]]=k; for (l in z) d=d";"l; print($1,$2,$3,substr(d, 2))}' file_name

Upvotes: 3

Removing all duplicate entries in a field

Answers (4)

Related Questions