Thanos
Thanos

Reputation: 586

Extracting data from a multicolumn file with missing entries

I am trying to extract some data from a multicolumn file named mass.mas03. Columns are seperated by tab. When the data starts you can see that there are entries that are empty(i.e. line 44 VS line 45).

The whole idea is to extract data from a specific column(MASS EXCESS, that is) for a number of rows, take the sum and create a horizontal line on a xy-axes system, with the y cordinate being the sum of those values.

As an example take lines 41,65,74

  -1    0    1    1 H          7288.97050    0.00011      0.0      0.0   B-      *                1 007825.03207    0.00010
   0    4    4    8 Be         4941.672      0.035     7062.435    0.004 B- -17979.819    1.001   8 005305.103      0.037
   4    7    3   10 Li   -n   33050.581     15.124     4531.555    1.512 B-  20443.910   15.123  10 035481.259     16.236

I want to do: 7288.97050 + 4941.672 - 33050.581 = - 20819.9385 . If I use an awk command to have access on the data, I face this problem

> awk '/8 Be/ {print $6}' mass.mas03
  4941.672 
>awk '/10 Li/ {print $6}' mass.mas03
  -n

I don't know if awk is the best option to do not only the data selection, as well as the "plot" so I am open to suggestions!

An obvious solution would be to use $7 instead of $6 where needed, but it's a large file and I'm trying to automate a bit the whole process.

Is there a way to "chose" the desired data and apply some simple calclations on them, for this weird column file?

Upvotes: 2

Views: 143

Answers (3)

Mr.Gaz
Mr.Gaz

Reputation: 11

to me the file is still structured for Fields starts at one point and ends at another a specific field always has the same number of charecters so it is a fixed width fields

$7 starts at 35 and its width is 12

use awk '/Li/ {print substr($0,35,12)}' mass.mas03

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 204259

if your data is really tab-separated then you'd simply use:

awk -F'\t' '{ whatever }'

If it's fixed width fields instead then you should use:

gawk 'BEGIN{FIELDWIDTHS="list of field widths"} { whatever }'

e.g.:

$ cat file
  -1    0    1    1 H          7288.97050    0.00011
   0    4    4    8 Be         4941.672      0.035
   4    7    3   10 Li   -n   33050.581     15.124

$ gawk 'BEGIN{FIELDWIDTHS="4 5 5 5 5 5 12 12"} {print $7}' file
  7288.970
  4941.672
 33050.581

Note that FIELDWIDTHS is gawk-specific. I've no idea if those are the correct field widths or not for your data, I just did a ballpark estimate to demonstrate the approach. Obviously if you want to convert it from fixed-with fields to tab-separated then it's trivial to do that:

$ gawk -v OFS='\t' 'BEGIN{FIELDWIDTHS="4 5 5 5 5 5 12 12"} {$1=$1}1' file
  -1        0       1       1    H                7288.97050        0.00011
   0        4       4       8    Be               4941.672          0.035
   4        7       3      10    Li      -n      33050.581         15.124

and if you want to strip off the leading/trailing spaces from each field:

$ gawk -v OFS='\t' 'BEGIN{FIELDWIDTHS="4 5 5 5 5 5 12 12"} {$1=$1; gsub(/ /,"")}1' file
-1      0       1       1       H               7288.97050      0.00011
0       4       4       8       Be              4941.672        0.035
4       7       3       10      Li      -n      33050.581       15.124

Upvotes: 2

Håkon Hægland
Håkon Hægland

Reputation: 40778

To replace spaces with tabs, you can try the following:

awk -f f.awk mass.mas03 > mass.mas03.tab

where f.awk is:

/^1N-Z/ {
    f=1
    print
    getline
    print
    next
}
f {
    gsub(/ +/,"\t")
}
{print}

Upvotes: 2

Related Questions