user14748664
user14748664

Reputation:

bash/awk: remove duplicate columns after merging of several files

I am using the following function written in my bash script in order to merge many files (contained multi-column data) into one big summary chart with all fused data

table_fuse () {   
    paste -d'\t' "${rescore}"/*.csv >> "${rescore}"/results_2PROTS_CNE_strategy3.csv | column -t -s$'\t'
}

Taking two files as an example, this routine would produce the following concatenated chart as the result of the merging:

# file 1.                   # file 2
Lig dG(10V1) dG(rmsd)   Lig dG(10V2) dG(rmsd)
lig1 -6.78 0.32 lig1 -7.04 0.20
lig2 -5.56 0.14 lig2 -5.79 0.45
lig3 -7.30 0.78 lig3 -7.28 0.71
lig4 -7.98 0.44 lig4 -7.87 0.42
lig5 -6.78 0.28 lig5 -6.75 0.31
lig6 -6.24 0.24 lig6 -6.24 0.24
lig7 -7.44 0.40 lig7 -7.42 0.39
lig8 -4.62 0.41 lig8 -5.19 0.11
lig9 -7.26 0.16 lig9 -7.30 0.13

Since the both files share the same first column (Lig), how would it be possible to remove (substitute to " ") all repeats of this column in each of the fussed file, while keeping only the Lig column from the first CSV?

Upvotes: 2

Views: 425

Answers (3)

RavinderSingh13
RavinderSingh13

Reputation: 133650

EDIT: As per OP's comments to cover [Ll]ig or [Ll]ig0123 or [Ll]ig(abcd) formats in file adding following solution here.

awk '{first=$1;gsub(/[Ll]ig([0-9]+)?(\([-azA-Z]+\))?/,"");print first,$0}'  Input_file


With awk you could try following, considering that you want to remove only lig(digits) duplicate values here.

awk '{first=$1;gsub(/[Ll]ig([0-9]+)?/,"");print first,$0}' Input_file

Explanation: Adding detailed explanation for above.

awk '                   ##Starting awk program from here.
{
  first=$1              ##Setting first column value to first here.
  gsub(/[Ll]ig([0-9]+)?/,"")  ##Globally substituting L/lig digits(optional) with NULL in whole line.
  print first,$0        ##printing first and current line here.
}
' Input_file            ##mentioning Input_file name here.

Upvotes: 3

markp-fuso
markp-fuso

Reputation: 35046

join would appear to be a solution, at least for the sample data provided by OP ...

Sample input data:

$ cat file1
Lig dG(10V1) dG(rmsd)
lig1 -6.78 0.32
lig2 -5.56 0.14
lig3 -7.30 0.78
lig4 -7.98 0.44
lig5 -6.78 0.28
lig6 -6.24 0.24
lig7 -7.44 0.40
lig8 -4.62 0.41
lig9 -7.26 0.16

$ cat file2
Lig dG(10V2) dG(rmsd)
lig1 -7.04 0.20
lig2 -5.79 0.45
lig3 -7.28 0.71
lig4 -7.87 0.42
lig5 -6.75 0.31
lig6 -6.24 0.24
lig7 -7.42 0.39
lig8 -5.19 0.11
lig9 -7.30 0.13

We can join these two files on the first column (aka field) like such:

$  join -j1 file1 file2
Lig dG(10V1) dG(rmsd) dG(10V2) dG(rmsd)
lig1 -6.78 0.32 -7.04 0.20
lig2 -5.56 0.14 -5.79 0.45
lig3 -7.30 0.78 -7.28 0.71
lig4 -7.98 0.44 -7.87 0.42
lig5 -6.78 0.28 -6.75 0.31
lig6 -6.24 0.24 -6.24 0.24
lig7 -7.44 0.40 -7.42 0.39
lig8 -4.62 0.41 -5.19 0.11
lig9 -7.26 0.16 -7.30 0.13

For more than 2 files some sort of repetitive/looping method would be needed to repeatedly join a new file into the mix.

Upvotes: 0

tripleee
tripleee

Reputation: 189679

It's not hard to replace repeats of phrases. What exactly works for your case depends on the precise input file format; but something like

sed s/^\([^ ]*\)\( .* \)\1 /\1\2 /' file

would get rid of any repeat of the token in the first column.

Perhaps a better solution is to use a more sophisticated merge tool, though. A simple Awk or Python script could take care of removing the first token from every file except the first while merging.

Upvotes: 1

Related Questions