Reputation: 575
I have a folder with several hundred files, each has the same structure, Here's a sample:
Start Date End Date Code1 Code2 Vendor Identifier Quantity V1_1 V1_2 Currency V1_3 ID V1_4 V2 V3 V4 TypeID OtherID Country_of_Sale V5 V6 V7 V8
11/27/16 12/31/16 character_value character_value 2 USD numeric_value character_value character_value character_value character_value AU
11/27/16 12/31/16 character_value character_value 1 USD numeric_value character_value character_value character_value character_value AU
11/27/16 12/31/16 character_value character_value 1 USD numeric_value character_value character_value character_value character_value AU
row count 3558
Country_of_Sale TotalA TotalB TotalC TotalD spu TotalE V2_1 V2_2 TotalF V2_3 V2_4
AR 0 2782223 2782223 7763.1 0.002790251 22 0.05 0.05 4626.17 5023
US 0 2497603034 2497603034 2958948.67 0.001184715 111374 109.33 109.33 1763291.86 1897441
DO 0 529132 529132 632.54 0.001195429 5 0.01 0.01 376.94 403
EC 0 794440 794440 1669.63 0.002101644 14 0.02 0.02 994.96 1087
BR 0 24397952 24397952 57932.77 0.002374493 217 0.43 0.43 34523.2 37225
Ctotal 109.84
Stotal 5680.38
Total 5790.22
As you can see, each file should really be two separate files; one with header row
Start Date End Date Code1 Code2 Vendor Identifier Quantity V1_1 V1_2 Currency V1_3 ID V1_4 V2 V3 V4 TypeID OtherID Country_of_Sale V5 V6 V7 V8
and one with header row
Country_of_Sale TotalA TotalB TotalC TotalD spu TotalE V2_1 V2_2 TotalF V2_3 V2_4
The line that separates these two always has $1 == row count (/^row count/?)
I want two outcome files, one for each header row described above. But there are, again, several hundred files -- all of them in a single directory --to pull those from:
I know my solution lies in awk. I don't know awk. I've researched for several hours, and I have figured out how to solve different parts of this, but just can't figure out how to pull it all together.
What I ultimately need is two tables that I can join (in SQL) on Country_of_Sale
.
Simple:
Start Date End Date UPC ISRC/ISBN Vendor Identifier Quantity V1_1 V1_2 Currency V1_3 ID V1_4 V2 V3 V4 TypeID OtherID Country_of_Sale V5 V6 V7 V8
11/27/16 12/31/16 character_value character_value 2 USD numeric_value character_value character_value character_value character_value AU
11/27/16 12/31/16 character_value character_value 1 USD numeric_value character_value character_value character_value character_value AU
11/27/16 12/31/16 character_value character_value 1 USD numeric_value character_value character_value character_value character_value AU
Country_of_Sale TotalA TotalB TotalC TotalD spu TotalE V2_1 V2_2 TotalF V2_3 V2_4
AR 0 2782223 2782223 7763.1 0.002790251 22 0.05 0.05 4626.17 5023
US 0 2497603034 2497603034 2958948.67 0.001184715 111374 109.33 109.33 1763291.86 1897441
DO 0 529132 529132 632.54 0.001195429 5 0.01 0.01 376.94 403
EC 0 794440 794440 1669.63 0.002101644 14 0.02 0.02 994.96 1087
BR 0 24397952 24397952 57932.77 0.002374493 217 0.43 0.43 34523.2 37225
I started with this:
gawk '
/^row count/ {nextfile}
NR == 1 {$0 = "Filename" OFS $0; print}
FNR > 1 {$0 = FILENAME OFS $0; print}
' OFS='\t' dir/to/raw/files/*.txt > dir/to/munged/file/file1.txt
and
gawk 'FNR==1,/^Country_Of_Sale/{next} /^CTotal/ {nextfile}
{ $0 = FILENAME OFS $0; print }' OFS='\t' dir/to/raw/files/*.txt > dir/to/munged/file/file2.tsv
Which kinda works, but I want to do it in one line.
So I messed around with various permutations of this:
awk -F, '{print > $1}' file1
But to be perfectly honest, I don't really get it. I'm more comfy with already-wrangled data.
I hope I've given enough here. I certainly don't want to exploit the resource.
Upvotes: 3
Views: 103
Reputation: 29167
Assuming your files have the .txt
extension and you want to name the generated files with the .txt.1
or .txt.2
extension, you could try something like:
awk 'BEGINFILE{f=FILENAME".1"} /^row count/{f=FILENAME".2";next} /^Ctotal/{nextfile} {print>f}' *.txt
Explanations:
At the beginning of the processing of each input file, variable f
is set to FILENAME.1
where FILENAME
(awk buit-in variable) is the name of the currently processes file.
When the current line of the current input file starts with row count
, variable f
is set to FILENAME.2
and the line is skipped.
When the current line of the current input file starts with Ctotal
the rest of the file is skipped.
Variable f
is used as the output file name for all non-skipped lines.
Upvotes: 1