Blaisem
Blaisem

Reputation: 637

Matching multiple files with numbers and excluding one of the files by number

I have a range of files, sorted according to number (File1.txt, File2.txt, File3.txt, etc.), that I am running a loop over in my script as input for awk code. I can pattern match these as

awk ... file[1-$i].txt >> output

I would however like to exclude a file within that range, such as

file$v.txt

goal

I am looking for something like

awk ... file[1-!$v-$i].txt >> output

where I match every file from 1-$i, skipping over the file with $v.


I have tried various inputs from composite pattern matching as described here, but I could not get the syntax to work for me.

Does anyone know how to do composite pattern matching like this? Thank you.


Sample inputs

On request, here are my files:

file.1.dat

29.078306 0.00676358
29.223592 0.00309192
30.297306 0.0174575
30.478883 0.132458
30.503705 0.118951
30.512891 0.0705088
31.945900 0.00408244
32.321011 0.00258023
32.894037 0.00407912
32.916263 0.00330154
34.594139 0.00874524
34.849178 0.0195172
34.884655 0.00547378
34.967403 0.00308369
35.325397 0.00818193

file.2.dat

25.970535 0.0979715
26.913976 0.00593039
29.078306 0.0984052
29.223592 0.00271504
30.236632 0.013818
30.478883 0.0347606
30.503705 0.102369
30.512891 0.0409633
31.714064 0.0242958
31.902306 0.0510168
32.715764 0.0146584
34.952965 0.00484555
35.190790 0.0114201
35.360372 0.0033089
35.575199 0.00282864
38.184618 0.00551692

file.3.dat

31.591771 0.0126916
32.059389 0.0605918
32.299959 0.122618
32.890418 0.0058495
32.962536 0.00492958
33.646214 0.0705359
33.679538 0.120592

file.4.dat

25.636267 0.00398174
27.848542 0.00485739
28.269278 0.0174401
29.418886 0.00409613
31.313212 0.203932
31.945900 0.00259743
32.256620 0.00325607
32.299959 0.0325366
33.461363 0.0798633
33.646214 0.0516498
33.679538 0.12871

file.5.dat

29.767600 0.00777448
32.299959 0.00777995
34.849178 0.0305844
34.884655 0.0126815
34.930799 0.0546924
34.952965 0.0711241

Awk Code

awk '
NR==FNR {
    a[$1]=$2
    next
}
($1 in a) {
    a[$1]+=$2
}
END {
    for(i in a)
        print i,a[i]
}' file.4.dat file.[1-5].dat >| test.out

This code does the following:

  1. Matches file.4.dat to file.1.dat, file.2.dat ... file.5.dat, based on the value in field 1 ($1).
  2. Whereever a match in $1 is found, it adds $2 to file.4.dat $2 in the matching row.
  3. test.out prints file.4.dat, with $2 being equal to a sum of $2 from matching $1 rows.

A simple example of what I am trying to do was asked in this question, which is where I have the awk code from.

Goal

My goal is to have the following line in my output:

33.679538 0.249302

among other correctly matched lines, but this line is my current test to see if it works. Right now, I have:

33.679538 0.378012

as a result of file.4.dat being added to itself in the awk code, since I cannot exclude it in my 2nd argument for input file.

Summary of Problems

My awk code is reading all of my input files, and I need to exclude 1 of the files in order to obtain the right output.

Ultimately, I have to input each of my 5 files individually against the other 4 files in the awk code above. In the future, the number of files will be variable, so I cannot just type the file names in my script. For now, if I can solve this at least for fewer than 10 files, it would be a major help.

Upvotes: 1

Views: 292

Answers (4)

Inian
Inian

Reputation: 85790

You can simply do this in awk, by identifying the first file you are using for reference and ignoring that for subsequent processing using nextfile option (requires GNU version) which skips processing the file for subsequent processing. Going by this logic, you should place the reference file, e.g. file.4.dat in your input as the first argument in the file list.

awk '
BEGIN{ ignoreFile = ARGV[1] }
NR==FNR {
    a[$1]=$2
    next
}
FILENAME == ignoreFile { nextfile }
($1 in a) {
    a[$1]+=$2
}
END {
    for(i in a)
        print i,a[i]
}' file.4.dat file.[1-5].dat >| test.out

OP wanted to know if they can build a pattern list of filenames that can be generated from the shell and used. It can be done but considering the relatively simpler option available from nextfile, this might look complex.

From your understanding you have n files and one of them would be used as a reference file. I would prefer using extglob feature of the bash shell to include all the files except the reference. For e.g. I'm creating files file1..10 for explaining this

touch file{1..10}
exclude=3

The extended shell options are set using shopt built-in

shopt -s extglob
list=(!(file"$exclude"))

Now print the array using declare -p list to see the list of files with just the reference file excluded. Now use the array in your awk as below. The array expansion "${list[@]}" results in all the files excluded you generated above.

awk ... file"$exclude" "${list[@]}"

Upvotes: 2

stack0114106
stack0114106

Reputation: 8711

Using pipelined awks. You have to give the last file as reference (here->4)

awk ' $(NF+1)=FILENAME' file.[1-3].dat file.5.dat file.4.dat |  
   awk ' { a[$1]+=$2; $2=a[$1] } /file.4.dat/ && NF-- '

with the given files

$ awk ' $(NF+1)=FILENAME' file.[1-3].dat file.5.dat file.4.dat |  
      awk ' { a[$1]+=$2; $2=a[$1] } /file.4.dat/ && NF-- '
25.636267 0.00398174
27.848542 0.00485739
28.269278 0.0174401
29.418886 0.00409613
31.313212 0.203932
31.945900 0.00667987
32.256620 0.00325607
32.299959 0.162935
33.461363 0.0798633
33.646214 0.122186
33.679538 0.249302

$

Upvotes: 0

RavinderSingh13
RavinderSingh13

Reputation: 133640

In case someone don't want to use OR don't have nextfile in its system then following could help.

awk -v ignore="file.4.dat" '
FNR==1{
    no_parse=""
}
FNR==NR {
    a[$1]=$2
    next
}
FILENAME == ignore{
    no_parse=1
}
no_parse{
    next
}
($1 in a) {
    a[$1]+=$2
}
END {
    for(i in a)
        print i,a[i]
}' file.4.dat file.[1-5].dat >| test.out

Created a variable named ignore and we could mention Input_file named which we need to ignore there, once that Input_file turn comes for parsing I have set a flag named no_parse to TRUE in that case that specific Input_file's no contents will be read(since next is used to skip all further statements)

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 204055

To skip a file you just set ARGV[its position in the arg list] to null. e.g.:

$ ls
file1  file2  file3

$ grep . file*
file1:x
file2:y
file3:z

$ awk 'BEGIN{ARGV[2]=""} {print FILENAME, $0}' file*
file1 x
file3 z

or you can remove the "bad" file by name rather than order in the arg list if you prefer:

$ awk 'BEGIN{for (i in ARGV) if (ARGV[i]=="file2") ARGV[i]=""} {print FILENAME, $0}' file*
file1 x
file3 z

$ awk 'BEGIN{bad["file2"]; for (i in ARGV) if (ARGV[i] in bad) ARGV[i]=""} {print FILENAME, $0}' file*
file1 x
file3 z

$ awk '
    BEGIN {
        split("file2 file3",tmp); for (i in tmp) bad[tmp[i]]
        for (i in ARGV) if (ARGV[i] in bad) ARGV[i]=""
    }
    {print FILENAME, $0}
' file*
file1 x

Upvotes: 1

Related Questions