Reputation: 279
I have a file that is of the following format:
text number number A;A;A;A;A;A
text number number B
text number number C;C;C;C;D;C;C;C;C
What I want to do is remove all repeats of the entries in the fourth column to end up with this:
text number number A
text number number B
text number number C;D
I'd prefer to use bash scripting for a solution to fit into a pipe with other text manipulation I'm doing to this file.
Thanks!
Upvotes: 4
Views: 165
Reputation: 47099
Assuming tab-delimited input, you could do it like this with GNU parallel:
parallel -C '\t' c4='$(echo {4} | tr ";" "\n" | sort -u | head -c-1 | tr "\n" ";");' \
echo -e '"{1}\t{2}\t{3}\t$c4"' :::: infile
Output:
text number number A
text number number B
text number number C;D
Upvotes: 2
Reputation: 8398
This might work too
awk -F";" '{
delete words
match($1,/[[:alpha:]]$/)
words[substr($1,RSTART,RLENGTH)]++
printf "%s",$1
for (i=2;i<=NF;i++){
if (!words[$i]++) printf ";%s",$i
}
printf "\n"
}' file
Notes:
Since ;
is used as field separator, it doesn't matter how many columns (or which delimiters are used for those columns) before A;A;A;A;A;A
/[[:alpha:]]$/
can be replaced with /[^[:space:]]+$/
to match multiple non-space characters instead of a single alphabet.
if (!words[$i]++) printf ";%s",$i
prints the column/character if it doesn't exist as a key for the associative array words
, i.e. if words[$i]
is 0.
Upvotes: 1
Reputation: 58371
This might work for you (GNU sed):
sed 's/.*\s/&\n/;h;s/.*\n//;:a;s/\(\([^;]\).*\);\2/\1/;ta;H;g;s/\n.*\n//' file
Upvotes: 2
Reputation: 23374
can achieve this using awk
. Split field 4 into an array using ; first
awk '{delete z; d=""; split($4,arr,";");for (k in arr) z[arr[k]]=k; for (l in z) d=d";"l; print($1,$2,$3,substr(d, 2))}' file_name
Upvotes: 3