Reputation:
I am using the following function written in my bash script in order to merge many files (contained multi-column data) into one big summary chart with all fused data
table_fuse () {
paste -d'\t' "${rescore}"/*.csv >> "${rescore}"/results_2PROTS_CNE_strategy3.csv | column -t -s$'\t'
}
Taking two files as an example, this routine would produce the following concatenated chart as the result of the merging:
# file 1. # file 2
Lig dG(10V1) dG(rmsd) Lig dG(10V2) dG(rmsd)
lig1 -6.78 0.32 lig1 -7.04 0.20
lig2 -5.56 0.14 lig2 -5.79 0.45
lig3 -7.30 0.78 lig3 -7.28 0.71
lig4 -7.98 0.44 lig4 -7.87 0.42
lig5 -6.78 0.28 lig5 -6.75 0.31
lig6 -6.24 0.24 lig6 -6.24 0.24
lig7 -7.44 0.40 lig7 -7.42 0.39
lig8 -4.62 0.41 lig8 -5.19 0.11
lig9 -7.26 0.16 lig9 -7.30 0.13
Since the both files share the same first column (Lig), how would it be possible to remove (substitute to " ") all repeats of this column in each of the fussed file, while keeping only the Lig column from the first CSV?
Upvotes: 2
Views: 425
Reputation: 133650
EDIT: As per OP's comments to cover [Ll]ig
or [Ll]ig0123
or [Ll]ig(abcd)
formats in file adding following solution here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?(\([-azA-Z]+\))?/,"");print first,$0}' Input_file
With awk
you could try following, considering that you want to remove only lig(digits)
duplicate values here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?/,"");print first,$0}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
first=$1 ##Setting first column value to first here.
gsub(/[Ll]ig([0-9]+)?/,"") ##Globally substituting L/lig digits(optional) with NULL in whole line.
print first,$0 ##printing first and current line here.
}
' Input_file ##mentioning Input_file name here.
Upvotes: 3
Reputation: 35046
join
would appear to be a solution, at least for the sample data provided by OP ...
Sample input data:
$ cat file1
Lig dG(10V1) dG(rmsd)
lig1 -6.78 0.32
lig2 -5.56 0.14
lig3 -7.30 0.78
lig4 -7.98 0.44
lig5 -6.78 0.28
lig6 -6.24 0.24
lig7 -7.44 0.40
lig8 -4.62 0.41
lig9 -7.26 0.16
$ cat file2
Lig dG(10V2) dG(rmsd)
lig1 -7.04 0.20
lig2 -5.79 0.45
lig3 -7.28 0.71
lig4 -7.87 0.42
lig5 -6.75 0.31
lig6 -6.24 0.24
lig7 -7.42 0.39
lig8 -5.19 0.11
lig9 -7.30 0.13
We can join
these two files on the first column (aka field) like such:
$ join -j1 file1 file2
Lig dG(10V1) dG(rmsd) dG(10V2) dG(rmsd)
lig1 -6.78 0.32 -7.04 0.20
lig2 -5.56 0.14 -5.79 0.45
lig3 -7.30 0.78 -7.28 0.71
lig4 -7.98 0.44 -7.87 0.42
lig5 -6.78 0.28 -6.75 0.31
lig6 -6.24 0.24 -6.24 0.24
lig7 -7.44 0.40 -7.42 0.39
lig8 -4.62 0.41 -5.19 0.11
lig9 -7.26 0.16 -7.30 0.13
For more than 2 files some sort of repetitive/looping method would be needed to repeatedly join a new file into the mix.
Upvotes: 0
Reputation: 189679
It's not hard to replace repeats of phrases. What exactly works for your case depends on the precise input file format; but something like
sed s/^\([^ ]*\)\( .* \)\1 /\1\2 /' file
would get rid of any repeat of the token in the first column.
Perhaps a better solution is to use a more sophisticated merge tool, though. A simple Awk or Python script could take care of removing the first token from every file except the first while merging.
Upvotes: 1