Christopher Turnbull
Christopher Turnbull

Reputation: 1003

Bash Scripting compliling specific csv rows

I'm another bash scripting newbie (having just discovered it, it blew my mind! It's so exciting) What I want to do is have a script that compiles a LOT of .csv files into just one bigfile.csv, removing the headers, and inserting my own header. I discovered the following solution:

awk 'FNR > 1' *.csv > bigfile.csv
sed -i 1i"Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..." bigfile.csv

Great! But when I try and use this file for analysis I'm getting errors because of bad lines. I had a look at it and indeed, there are a few crazy entries in there.

Luckily, every row that I want from the original .csv files has the entry "MD" the first column. So does anyone know how I can tell awk to only take the lines form the .csv files that have "MD" in their first cell?

EDIT: Thanks for your help guys, it worked a charm! Unfortunately there's still some weird data in there

CParserError: Error tokenizing data. C error: Expected 51 fields in line 6589, saw 54

With a simple adjustment, is there a way to again only take lines with 51 fields?

Upvotes: 0

Views: 128

Answers (3)

sjsam
sjsam

Reputation: 21965

awk 'BEGIN{print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
     if(FNR > 1){print}' *.csv > bigfile.csv

FNR resets after each file that awk process, but NR doesn't and NR=FNR only for the first file.


A small Illustration (of course with my test data)

$ cat f1
Name,Roll
A,10
B,5
5$ cat f2
Name,Roll
C,56
D,44
$ awk 'BEGIN{print "Naam,RollNo"}FNR > 1{print}' f*>final
$ cat final 
Naam,RollNo
A,10
B,5
C,56
D,44

Note

As you could see, the new header for the final file went to awk BEGIN section which get executed only at the beginning.


Coming to your objective

Every row that I want from the original .csv files has the entry "MD" the first column

awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
     if(FNR > 1 && $1 == "MD" && NF == 51){print}' *.csv > bigfile.csv

Notes

This one has few differences from the first general case.

  • It introduces , as the field seperator
  • FNR > 1 && $1 == "MD" means hey I don't want the header(FNR=1) and print stuff only when first field is MD($1 == "MD") and the number of fields is 51(NF == 51)

The Idiomatic way

As [ @ghoti ] mentioned in his comment :

awk's "default" command is already {print}

So the above script may be re-written as :

awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
         (FNR > 1 && NF == 51 && $1 == "MD")' *.csv > bigfile.csv

Upvotes: 3

ghoti
ghoti

Reputation: 46886

I'm going to go out on a limb here and assume that the line you're adding with sed is actually the headers that you're stripping off.

If that's the case, I'd suggest you skip the sed line, and just tell awk to strip the first line on files that are not the first one.

Next, if you only want lines containing the text MD in the first field, you can test that with a simple regex.

awk -F, '
    FNR==1 && NR > 1 { next }  # skip the header on all but the first file
    NF != 51 { next }          # skip this line if field count is wrong
    $1 ~ /MD/                  # print the line if the first field matches
' *.csv > /path/to/outputfile.csv
  • The -F, option tells awk to split fields using a comma as field separator.
  • NR is the total number of records processed, while FNR is the current record number in the current file.
  • A condition with no commands assumes print as the command (printing the current line).

You can of course put this entire awk script on one line if you like. I split it out for easier reading.

If your outputfile.csv is in the same directory where you are getting your "glob" of input csv files, then be aware that the new file will be created by the shell, not by awk, and might also be processed as an input file. This could be a concern if you were planning to append your redirect to an existing file with >>.

UPDATE

As you've mentioned that the headers you're adding are different from the ones you strip off, you can still avoid using a separate command like sed, by changing the awk script to something like this:

awk -F, '
    BEGIN {
      print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."
    }
    FNR==1 { next }            # skip the header on all files
    NF != 51 { next }          # skip this line if field count is wrong
    $1 ~ /MD/                  # print the line if the first field matches
' *.csv > /path/to/outputfile.csv

Commands within awk's BEGIN block are executed before any input lines are processed, so if you print new headers there, they will appear at the beginning of your (redirected) output. (Note that there is a similar END block if you want to generate a footer/summary/etc after all input has been processed.)

Upvotes: 4

Inian
Inian

Reputation: 85875

A fancy one-liner would like:-

awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51 { print }' *.csv > /someotherpath/bigfile.csv

A proper way with the complete bash script would be something like instead of fancy one-liners:-

#!/bin/bash

# Am assuming the the '.csv' files are a single ',' separated 

for i in *.csv; do
    [ -e "$i" ] || continue    # To handle when no input *.csv files present
    awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51  { print }' "$i" > /someotherpath/bigfile.csv
done

The crux of the solution is using awk's NR & NF variables, which keeps track of the current row and the nth field within the row, so ideally NR > 1 would skip the header part from being processed and $1 ~ /^MD/ returns only the lines in the file whose first column starts with the pattern and NF ==51 prints those lines containing exactly 51 fields.

Upvotes: 2

Related Questions