Reputation: 65
I have a set of data that consists of seismic wave travel times and their corresponding information (i.e. source that produced the wave and the time for that wave arriving at each geophone along the spread). I am trying to format the data to fit my code in order to do some tomography using the data, but I'm still relatively new to awk. I am at a point where I need to now insert the number of receivers for each shot/source into the line of shot/source information, but its a variable amount each time. Is there a way to have awk count the number of rows and insert that into the proper field?
My data is formatted like the following.
Each line that documents a source/shot:
s 0.01 0 0 -1 0
Every other line that follows the source/shot information:
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
I can use the "s" as a flag for the shot lines, and I would like to count the number of "r" lines for each source/shot and insert that number into the corresponding "s" line.
The number of "r" lines for each "s" line varies greatly.
Given this sample input:
s 0.01 0 0 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 0 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01
The expected output is:
s 0.01 0 3 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 5 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01
Note the 3
as $4 in the first s
line and the 5
as $4 in the second one.
The counted number of rows should be in the 4th column of each "s" line (asterisks here).
My experience with awk is limited to just rearranging/indexing columns, so I don't really know where to begin with this. I've tried googling help with awk, but it's very difficult to find answered awk questions that actually pertain to my specific situation (hence why I have decided to ask it myself).
I'm also new to using stackoverflow, so if I need to include more example data, please let me know. My data consists of approximately 4000 lines.
EDIT: The reason the desired result has slightly different data to the example of my data is because there are hundreds of lines for each "s" line and including that in the question seems excessive. I have cut out the majority of the data for ease of reading.
Upvotes: 3
Views: 201
Reputation: 1818
If you are using gawk
(GNU awk
), you can use the gensub
function as follows, leveraging awk's ability to use arbitrary strings as field and record separators (demo):
awk -v RS='s' -v FS='r' -v ORS='' -v OFS='r' \
'{ $1 = gensub(/[0-9.-]+/, NF - 1, 3, $1) } NR != 1 { print RS $0 }' \
input.dat
If the gensub
function is not available it is a little more involved to split the s
line and make the change, but it is still doable (see below).
The approach works as follows:
FS
) and record separator(RS
) to r
and s
respectively. This means that awk
will read all the s
and its r
lines as one record, and the number of occurrences of r
would be the number of fields (NF
) minus 1.gensub
function to replace the n-th occurrence of a regex match on the first field (i.e., the s
row) and return that value. In our case, we would want to replace the third number with the count, i.e., NF -1
.{print}
prints out all lines since it matches all lines in the input. We skip the first row because awk
would see an empty row in front of the first s
, which we want to ignore.ORS
and OFS
) to the same values as they are during input so that awk does not print them out with its default values (i.e., spaces and new lines). Usually, we would set the ORS
to s
but this would produce a trailing s
, and so we handle the output of the s
in the output manually with { print RS
bit and set ORS
to empty.A variation of this answer that would work even when the gensub
function is not available is as follows (demo):
awk -v RS='s' -v FS='r' -v ORS='' -v OFS='r' \
'NR != 1 {
string = "";
split($1, numbers," ");
numbers[3] = NF - 1;
for(i in numbers) string = string " " numbers[i];
$1 = string "\n";
print "s" $0
}'
The logic remains the same, except that since we cannot use a handy regex function to do the replacement, we split the string and replace the part we need.
Upvotes: 0
Reputation: 45
When solving problems for myself I often have to resort to dirty techniques, such as modifying the input. Here, I'm adding a line starting with "s" at the end of the file to avoid creating an END block. If my code were simpler, having an END block would be much preferred, of course. Or, apparently, grokking awk's function syntax would have helped me to simplify as well.
sed '$a\
s
' input |
awk '
BEGIN {
delete lines[0]
}
{
line=$0
}
($1 == "s") {
ln=NR
if (length(lines) > 0) {
$0=lines[0]
delete lines[0]
$0=$0
$4=length(lines)
print
for (i in lines) {
$0 = lines[i]
delete lines[i]
print
}
}
lines[0]=line
}
($1 == "r") {
lines[NR-ln]=line
}
'
Upvotes: 1
Reputation: 103754
Here is a Ruby based on a multi-line regex:
ruby -e 'puts $<.read.scan(/(^s.*\R)((?:^r.*\R?)+)/).
map{|s,r| n=r.split(/\R/).length; a=s.split; a[3]=n; "#{a.join(" ")}\n#{r}"}' file
Or, reverse the lines in memory and print at the end:
ruby -lane 'BEGIN{lines=[]}
lines<<$F
END{
n=0
puts lines.reverse.
map{|l| if l[0]=="s" then l[3]=n; n=0 else n+=1 end; l.join(" ")}.
reverse.join("\n")
}
' file
Or, parse the input into rolling blocks. The advantage here is only the relevant block has to be in memory:
ruby -lane 'BEGIN{
lines=[]
def print_block(block) = puts block.map{|l| l.join(" ")}.join"\n"
}
if $F[0]=="s" then
print_block(lines) if lines.length>0
$F[3]=$F[3].to_i
lines=[$F]
else
lines[0][3]+=1
lines<<$F
end
END{print_block(lines)}
' file
Or you can use this GNU awk:
gawk '@include "join"
function p(){
for(i=1;i<=length(lines); i++)
print join(lines[i],1,length(lines[i])," ")
}
/^s/{
if (lc>1) p()
delete lines
lc=1
for (i=1;i<=NF;i++) lines[lc][i]=$i
}
/^r/{
lc++
for (i=1;i<=NF;i++) lines[lc][i]=$i
lines[1][4]++
}
END{p()}' file
Any of these prints:
s 0.01 0 3 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 5 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01
Upvotes: 1
Reputation: 203229
Using any awk:
$ awk '
/^s/ {
if (NR>1) {
prt()
}
cnt = 0
shot = $0
rs = ""
}
/^r/ {
cnt++
rs = rs $0 ORS
}
END { prt() }
function prt( orig) {
orig = $0
$0 = shot
$4 = cnt+0
print $0
printf "%s", rs
$0 = orig
}
' file
s 0.01 0 3 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 5 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01
Upvotes: 4
Reputation: 34124
Undoing the desired updates from OP's expected output gives me the following input:
$ cat input.dat
s 0.01 0 0 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 0 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01
One awk
idea:
awk '
function print_block(i) {
if (s_line) { # if s_line not empty then ...
sub(/COUNT/,cnt,s_line) # replace "COUNT" with actual count and ...
print s_line # print s line and ..
for (i=1; i<=cnt; i++)
print r_lines[i] # r lines to stdout
}
cnt = 0
s_line = ""
delete r_lines
}
$1 == "s" { print_block() # print previous block of s/r lines
$4 = "COUNT" # replace 4th field with placeholder "COUNT"
s_line = $0 # save current s line
}
$1 == "r" { r_lines[++cnt] = $0 } # save r lines
END { print_block() } # flush last s/r block to stdout
' input.dat
This generates:
s 0.01 0 3 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 5 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01
Upvotes: 1
Reputation: 16662
A simple method is to read the file backwards.
r
line, increment a counters
line, substitute the counter and reset itand then reverse the result:
tac input |
awk '
/^r/ { n++ }
/^s/ { $4=n; n=0 }
{ print }
' |
tac > output
You can read the file forwards but that involves maintaining state:
awk '
/^s/ {
# this prints the *previous* group of lines
if (NR>1)
print c1,c2,c3, n, c5,c6, r
# save s columns, initialise n counter and r string
c1=$1; c2=$2; c3=$3; n=0; c5=$5; c6=$6; r=""
}
/^r/ {
n++
r = r RS $0
}
END {
# print final group
print c1,c2,c3, n, c5,c6, r
}
' input >output
Upvotes: 6