jaksco
jaksco

Reputation: 531

bash - add columns to csv rewrite headers with prefix filename

I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big

I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.

head globcover_color.csv glds00g.csv

==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889


==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775

The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.

Anticipated output:

file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number

edit:

I figured out the first part.

paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced

Upvotes: 0

Views: 253

Answers (1)

Ed Morton
Ed Morton

Reputation: 203229

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    fname = FILENAME
    sub(/\.[^.]+$/,"",fname)
    for (i=1; i<=NF; i++) {
        $i = fname "_" $i
    }
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
    for (rowNr=1; rowNr<=FNR; rowNr++) {
        print row[rowNr]
    }
}

$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

To use minimal memory in awk:

$ cat tst.awk
BEGIN {
    FS=OFS=","
    for (fileNr=1; fileNr<ARGC; fileNr++) {
        filename = ARGV[fileNr]
        if ( (getline < filename) > 0 ) {
            fname = filename
            sub(/\.[^.]+$/,"",fname)
            for (i=1; i<=NF; i++) {
                $i = fname "_" $i
            }
        }
        row = (fileNr==1 ? "" : row OFS) $0
    }
    print row
    exit
}

$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

Upvotes: 1

Related Questions