VRehnberg
VRehnberg

Reputation: 582

(sed/awk) extract values text file and write to csv (no pattern)

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.

My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.

The file let's call it my_file_1.txt has a structure that looks something like this

lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...

and I would like to construct something like

file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...

How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.

I don't really have any experience with awk. With sed my best guess would be

filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
  s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
  h
  $!N
  /.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
  D
  T
  G
  P
' $filename | sed -z 's/,\n/,/' >> my_data.csv

and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.

It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.

Solution (Edit)

I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.

Awk (Rusty Lemur's answer)

Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.

BEGIN {
  counter = 1 
  OFS = ","   # This is the output field separator used by the print statement
  print "file", "start", "stop", "epoch", "run"  # Print the header line
}

/start value/ {
  startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0) 
}

/epoch/ {
  epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0) 
}

/stop value/ {
  stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0) 
  
  # we have everything to print our line
  print FILENAME, startValue, stopValue, epoch, counter
  counter = counter + 1 
  startValue = "" # clear variables so they aren't maintained through the next iteration
  epoch = ""
}

I accepted this answer because it most understandable.

Sed (potong's answer)

sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
        /^.*start value/{:a;N;/\n.*stop value/!ba;x
        s/.*/expr & + 1/e;x;G;F
        s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt |         sed '1!N;s/\n//'

Upvotes: 1

Views: 1258

Answers (3)

potong
potong

Reputation: 58483

This might work for you (GNU sed):

sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
        /^start value/{:a;N;/\nstop value/!ba;x
        s/.*/expr & + 1/e;x;G;F
        s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' file |
        sed '1!N;s/\n//'

The solution contains two invocations of sed, the first to format all but the file name and second to embed the file name into the csv file.

Format the header line on the first line and prime the run number.

Gather up lines between start value and stop value.

Increment the run number, append it to the current line and output the file name. This prints two lines per record, the first is the file name and the second the remainder of the csv file.

In the second sed invocation read two lines at a time (except for the first line) and remove the newline between them, formatting the csv file.

Upvotes: 1

Rusty Lemur
Rusty Lemur

Reputation: 1885

awk's basic structure is:

  1. read a record from the input (by default a record is a line)
  2. evaluate conditions
  3. apply actions

The record is split into fields (by default based on whitespace as the separator). The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second. The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.

A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).

BEGIN {
  counter = 1
  OFS = ","   # This is the output field separator used by the print statement
  print "file", "start", "stop", "epoch", "run"  # Print the header line
}

/start value/ {
  startValue = $NF  # when a line contains "start value" store the last field as startValue 
}

/epoch/ {
  epoch = $NF
}

/stop value/ {
  stopValue = $NF

  # we have everything to print our line
  print FILENAME, startValue, stopValue, epoch, counter
  counter = counter + 1
  startValue = "" # clear variables so they aren't maintained through the next iteration
  epoch = ""
}

Save that as processor.awk and invoke as:

awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 204015

It's not clear how you'd get exactly the output you provided from the input you provided but this may be what you're trying to do (using any awk in any shell on every Unix box):

$ cat tst.awk
BEGIN {
    OFS = ","
    print "file", "start", "stop", "epoch", "run"
}
{ f[$1] = $NF }
$1 == "stop" {
    print FILENAME, f["start"], f["stop"], f["epoch"], ++run
    delete f
}

$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2

Upvotes: 3

Related Questions