Reputation:
Dealing with multi-line CSV file, I am looking for a possible Bash shell workflow that could be useful for its treatment. Here is format of the file containing data in multi-column format:
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1000.dlg: 6 | -4.86 | 2 | -4.79 | 4 |####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1001.dlg: 2 | -5.25 | 10 | -5.22 | 8 |########
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1002.dlg: 5 | -5.76 | 6 | -5.48 | 3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1003.dlg: 4 | -3.88 | 17 | -3.50 | 3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1009.dlg: 5 | -4.51 | 5 | -4.39 | 4 |####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_100.dlg: 3 | -4.40 | 11 | -4.38 | 9 |#########
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1010.dlg: 1 | -5.07 | 15 | -4.51 | 5 |#####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_150.dlg: 4 | -5.01 | 5 | -4.82 | 3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_156.dlg: 2 | -5.38 | 11 | -4.70 | 3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_157.dlg: 1 | -4.22 | 10 | -4.16 | 7 |#######
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_167.dlg: 2 | -3.85 | 3 | -3.69 | 9 |#########
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_168.dlg: 2 | -4.42 | 12 | -4.01 | 6 |######
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_169.dlg: 2 | -4.94 | 17 | -4.80 | 5 |#####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_16.dlg: 1 | -6.23 | 4 | -5.77 | 4 |###
According to the format: all the columns with valuable information are divided by |
with the exception of the first column (name of the line), divided by :
from the rest. The script should operate with following post-processing:
#
), discarding all of the lines containing #
, ##
or ###
. Alternatively this filter can be applied on the penultimate column, which expresses the number of #
characters as a number.While I can do the first task using sort
sort -t '|' -k 3 filename.csv
and the second may be achieved using AWK
awk '(NR>1) && ($8 > 2) ' filename.csv > filename_processed.txt
how could I combine the both commands in efficient fashion taking into account the format of my file?
Upvotes: 2
Views: 127
Reputation: 133620
Could you please try following, written and tested in shown samples in GNU awk
.
awk '
BEGIN{
FS=OFS="|"
}
gsub(/#/,"&",$6)>4
' Input_file | sort -t'|' -nk 3 > output_file
EDIT: As per OP's comment to get last 10% lines from starting of Input_file you could following, take above command's output into a output file and could run following.
awk -v lines="$(wc -l < output_file)" '
BEGIN{
tenPer=int(lines/10)
}
FNR>(tenPer){exit}
1
' output_file
For getting 10% last lines of output_file try:
tac output_file |
awk -v lines="$(wc -l < output_file)" 'BEGIN{tenPer=int(lines/10)} FNR>tenPer{exit} 1' |
tac
OR
awk -v lines="$(wc -l < output_file)" 'BEGIN{tenPer=int(lines/10)} FNR>=(lines-tenPer)' output_file
Upvotes: 3
Reputation: 5975
It is better to sort at the end with fewer lines.
grep -E "#{4}$" file | sort -t"|" -nk3
If you need to filter for different number of #
modify the number in the expression of grep
. If you need reversed sorting add the r
parameter to the sort
command. If you need sorting per different column, modify the k
argument.
Upvotes: 3
Reputation: 189658
If your commands are really all you need, trivially
awk '(NR>1) && ($8 > 2) ' filename.csv |
sort -t '|' -k 3 filename.csv > filename_processed.txt
Upvotes: 1
Reputation: 122
You can try:
sort -nr -k 4 scratch.scv | grep -v -E "[^#]#{1,3}$"
Sort base on column value and eject the line with 1-3 number of #.
Upvotes: 3