Reputation: 1003
I'm another bash scripting newbie (having just discovered it, it blew my mind! It's so exciting) What I want to do is have a script that compiles a LOT of .csv files into just one bigfile.csv, removing the headers, and inserting my own header. I discovered the following solution:
awk 'FNR > 1' *.csv > bigfile.csv
sed -i 1i"Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..." bigfile.csv
Great! But when I try and use this file for analysis I'm getting errors because of bad lines. I had a look at it and indeed, there are a few crazy entries in there.
Luckily, every row that I want from the original .csv files has the entry "MD" the first column. So does anyone know how I can tell awk to only take the lines form the .csv files that have "MD" in their first cell?
EDIT: Thanks for your help guys, it worked a charm! Unfortunately there's still some weird data in there
CParserError: Error tokenizing data. C error: Expected 51 fields in line 6589, saw 54
With a simple adjustment, is there a way to again only take lines with 51 fields?
Upvotes: 0
Views: 128
Reputation: 21965
awk 'BEGIN{print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
if(FNR > 1){print}' *.csv > bigfile.csv
FNR
resets after each file that awk process, but NR doesn't and NR=FNR
only for the first file.
A small Illustration (of course with my test data)
$ cat f1
Name,Roll
A,10
B,5
5$ cat f2
Name,Roll
C,56
D,44
$ awk 'BEGIN{print "Naam,RollNo"}FNR > 1{print}' f*>final
$ cat final
Naam,RollNo
A,10
B,5
C,56
D,44
Note
As you could see, the new header for the final file went to awk BEGIN
section which get executed only at the beginning.
Coming to your objective
Every row that I want from the original .csv files has the entry "MD" the first column
awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
if(FNR > 1 && $1 == "MD" && NF == 51){print}' *.csv > bigfile.csv
Notes
This one has few differences from the first general case.
,
as the field seperatorFNR > 1 && $1 == "MD"
means hey I don't want the header(FNR=1) and print stuff only when first field is MD($1 == "MD") and the number of fields is 51(NF == 51)The Idiomatic way
As [ @ghoti ] mentioned in his comment :
awk's "default" command is already
{print}
So the above script may be re-written as :
awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
(FNR > 1 && NF == 51 && $1 == "MD")' *.csv > bigfile.csv
Upvotes: 3
Reputation: 46886
I'm going to go out on a limb here and assume that the line you're adding with sed
is actually the headers that you're stripping off.
If that's the case, I'd suggest you skip the sed
line, and just tell awk to strip the first line on files that are not the first one.
Next, if you only want lines containing the text MD
in the first field, you can test that with a simple regex.
awk -F, '
FNR==1 && NR > 1 { next } # skip the header on all but the first file
NF != 51 { next } # skip this line if field count is wrong
$1 ~ /MD/ # print the line if the first field matches
' *.csv > /path/to/outputfile.csv
-F,
option tells awk to split fields using a comma as field separator.NR
is the total number of records processed, while FNR
is the current record number in the current file.print
as the command (printing the current line).You can of course put this entire awk script on one line if you like. I split it out for easier reading.
If your outputfile.csv is in the same directory where you are getting your "glob" of input csv files, then be aware that the new file will be created by the shell, not by awk, and might also be processed as an input file. This could be a concern if you were planning to append your redirect to an existing file with >>
.
UPDATE
As you've mentioned that the headers you're adding are different from the ones you strip off, you can still avoid using a separate command like sed, by changing the awk script to something like this:
awk -F, '
BEGIN {
print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."
}
FNR==1 { next } # skip the header on all files
NF != 51 { next } # skip this line if field count is wrong
$1 ~ /MD/ # print the line if the first field matches
' *.csv > /path/to/outputfile.csv
Commands within awk's BEGIN
block are executed before any input lines are processed, so if you print new headers there, they will appear at the beginning of your (redirected) output. (Note that there is a similar END
block if you want to generate a footer/summary/etc after all input has been processed.)
Upvotes: 4
Reputation: 85875
A fancy one-liner would like:-
awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51 { print }' *.csv > /someotherpath/bigfile.csv
A proper way with the complete bash
script would be something like instead of fancy one-liners:-
#!/bin/bash
# Am assuming the the '.csv' files are a single ',' separated
for i in *.csv; do
[ -e "$i" ] || continue # To handle when no input *.csv files present
awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51 { print }' "$i" > /someotherpath/bigfile.csv
done
The crux of the solution is using awk
's NR
& NF
variables, which keeps track of the current row and the nth
field within the row, so ideally NR > 1
would skip the header part from being processed and $1 ~ /^MD/
returns only the lines in the file whose first column starts with the pattern and NF ==51
prints those lines containing exactly 51 fields.
Upvotes: 2