user4007276
user4007276

Reputation:

shell: treatment of the multi-line format according to its column patterns

Dealing with multi-line CSV file, I am looking for a possible Bash shell workflow that could be useful for its treatment. Here is format of the file containing data in multi-column format:

/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1000.dlg:   6 |     -4.86 |   2 |     -4.79 |   4 |####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1001.dlg:   2 |     -5.25 |  10 |     -5.22 |   8 |########
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1002.dlg:   5 |     -5.76 |   6 |     -5.48 |   3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1003.dlg:   4 |     -3.88 |  17 |     -3.50 |   3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1009.dlg:   5 |     -4.51 |   5 |     -4.39 |   4 |####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_100.dlg:   3 |     -4.40 |  11 |     -4.38 |   9 |#########
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1010.dlg:   1 |     -5.07 |  15 |     -4.51 |   5 |#####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_150.dlg:   4 |     -5.01 |   5 |     -4.82 |   3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_156.dlg:   2 |     -5.38 |  11 |     -4.70 |   3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_157.dlg:   1 |     -4.22 |  10 |     -4.16 |   7 |#######
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_167.dlg:   2 |     -3.85 |   3 |     -3.69 |   9 |#########
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_168.dlg:   2 |     -4.42 |  12 |     -4.01 |   6 |######
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_169.dlg:   2 |     -4.94 |  17 |     -4.80 |   5 |#####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_16.dlg:   1 |     -6.23 |   4 |     -5.77 |   4 |###

According to the format: all the columns with valuable information are divided by | with the exception of the first column (name of the line), divided by : from the rest. The script should operate with following post-processing:

  1. Descending sorting of all lines according to the value from the third column (from mostly negative to positive values);
  2. Set up some filter to the last column (according to the number of #), discarding all of the lines containing #, ## or ###. Alternatively this filter can be applied on the penultimate column, which expresses the number of #characters as a number.

While I can do the first task using sort

sort -t '|' -k 3 filename.csv

and the second may be achieved using AWK

awk '(NR>1) && ($8 > 2) ' filename.csv > filename_processed.txt

how could I combine the both commands in efficient fashion taking into account the format of my file?

Upvotes: 2

Views: 127

Answers (4)

RavinderSingh13
RavinderSingh13

Reputation: 133620

Could you please try following, written and tested in shown samples in GNU awk.

awk '
BEGIN{
  FS=OFS="|"
}
gsub(/#/,"&",$6)>4
' Input_file | sort -t'|' -nk 3 > output_file


EDIT: As per OP's comment to get last 10% lines from starting of Input_file you could following, take above command's output into a output file and could run following.

awk -v lines="$(wc -l < output_file)" '
BEGIN{
  tenPer=int(lines/10)
}
FNR>(tenPer){exit}
1
' output_file

For getting 10% last lines of output_file try:

tac output_file | 
awk -v lines="$(wc -l < output_file)" 'BEGIN{tenPer=int(lines/10)} FNR>tenPer{exit} 1' | 
tac

OR

awk -v lines="$(wc -l < output_file)" 'BEGIN{tenPer=int(lines/10)} FNR>=(lines-tenPer)' output_file

Upvotes: 3

thanasisp
thanasisp

Reputation: 5975

It is better to sort at the end with fewer lines.

grep -E "#{4}$" file | sort -t"|" -nk3

If you need to filter for different number of # modify the number in the expression of grep. If you need reversed sorting add the r parameter to the sort command. If you need sorting per different column, modify the k argument.

Upvotes: 3

tripleee
tripleee

Reputation: 189658

If your commands are really all you need, trivially

awk '(NR>1) && ($8 > 2) ' filename.csv |
sort -t '|' -k 3 filename.csv > filename_processed.txt

Upvotes: 1

stackoverflower
stackoverflower

Reputation: 122

You can try:

sort -nr -k 4 scratch.scv | grep -v -E "[^#]#{1,3}$"

Sort base on column value and eject the line with 1-3 number of #.

Upvotes: 3

Related Questions