Weiss Willy
Weiss Willy

Reputation: 109

AWK-Argument list too long (For good reason)

I must process a CSV file, and I need to repace some values from a column with a specific index number. Example

10/06-14:04:21.082467 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:22.082715 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:23.082940 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:24.083256 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:27.421793 ,1418,64878,192.168.0.13,161,0xC498BF38,0x0,,
10/06-14:04:27.522099 ,1418,,64879,192.168.0.13,161,0xC499BF39,0x0,,
10/06-14:04:33.445012 ,1421,64878,192.168.0.13,705,0xC498BF38,0x0,,
10/06-14:04:33.545144 ,1421,192.168.0.130xC498BF38,0x0,,

The column of interest is Column number 2. I extract it into a separate tmp file and I looks like this:

1917
1917
1917
1917
1418
1418
1421
1421

My expected output is:

5
5
5
5
2
2
0
0

Until now I use the script below:

csvtool col 2  $file>tmp && echo " TEMPORARY Column DONE "&& echo "Start the Magic"

## Perform an AWK comparation between the integers intervals and the terget number.
##In the end write everything out in the new column .
awk '{ 
if( ($1>=363 && $1<=499) || ($1>=4645 && $1<=4646)){ print 0}  
else if( ($1>=2174 && $1<=2193)) { print 1}  
else if( ($1==500) || ($1>=12308 && $1<=12356)){ print 2} 
else if( ($1>=103 && $1<=220) || ($1>=252 && $1<=299) || ($1>=1980 && $1<=1986) || ($1>=2921 && $1<=2922)){ print 3} 
else if( ($1>=221 && $1<=251) || ($1>=8085 && $1<=8091) || ($1==8350) || ($1>=12809 && $1<=12945) || ($1>=16834 && $1<=17033)){ print 4} 
else if( ($1>=300 && $1<=362) || ($1==522) || ($1>=2923 && $1<=2925) || ($1>=3441 && $1<=3442) || ($1==4644)|| ($1>=5677 && $1<=5695) || ($1>=8082 && $1<=8083)|| ($1>=8093 && $1<=8349) || ($1>=12946 && $1<=12947) || ($1>=12986 && $1<=13215) || ($1>=13309 && $1<=13311)){ print 5}
else if( ($1>=501 && $1<=504) || ($1>=566 && $1<=600) || ($1>=613 && $1<=637) ||  ($1>=2015 && $1<=2040) ||  ($1>=2103 && $1<=2126) || ($1>=2373 && $1<=2374) || ($1>=3828 && $1<=4125) || ($1>=4237 && $1<=4636) || ($1>=4647 && $1<=4889) || ($1>=4991 && $1<=5676) || ($1>=5696 && $1<=5705) || ($1>=6502 && $1<=6595) || ($1>=8429 && $1<=8460) || ($1>=8552 && $1<=8699) || ($1>=10487 && $1<=10977) || ($1>=11326 && $1<=11617) || ($1>=11688 && $1<=11815) || ($1>=11844 && $1<=11938) || ($1>=12490 && $1<=12597) || ($1>=12973 && $1<=12982) || ($1>=13367 && $1<=13414)){ print 6}
else if( ($1>=523 && $1<=548) || ($1>=555 && $1<=565) || ($1>=2005 && $1<=2014) || ($1>=2041 && $1<=2063) || ($1>=2091 && $1<=2102) ||  ($1==2394) || ($1>=2407 && $1<=2411) || ($1>=2926 && $1<=3008) || ($1>=3443 && $1<=3473) || ($1>=3486 && $1<=3813) || ($1>=4132 && $1<=4144) || ($1>=4637 && $1<=4643) || ($1>=4916 && $1<=4981) || ($1>=5711 && $1<=5741) || ($1>=6403 && $1<=6405) || ($1>=6415 && $1<=6466) || ($1>=6701 && $1<=7002) || ($1>=7035 && $1<=7048) || ($1>=8426 && $1<=8428) || ($1>=8496 && $1<=8541) || ($1>=8857 && $1<=9323) || ($1>=9429 && $1<=9618) || ($1>=9674 && $1<=9789) || ($1>=9802 && $1<=9811) || ($1>=9850 && $1<=10009) || ($1>=10131 && $1<=10136) || ($1>=10396 && $1<=10402) || ($1>=11000 && $1<=11175) || ($1==11618) || ($1>=12100 && $1<=12111) || ($1>=12212 && $1<=12219) || ($1==12489) || ($1>=12807 && $1<=12808) || ($1==12983) || ($1>=14616 && $1<=14627) || ($1>=15723 && $1<=15897)){ print 7}
else if( ($1==521) || ($1==554) || ($1>=601 && $1<=612) || ($1>=651 && $1<=708) || ($1>=1905 && $1<=1942) || ($1>=1949 && $1<=1979) || ($1>=1987 && $1<=1993) || ($1>=2259 && $1<=2278) || ($1>=2352 && $1<=2362) || ($1>=2395 && $1<=2406) || ($1>=2412 && $1<=2449) || ($1>=2673 && $1<=2919) || ($1>=3009 && $1<=3016) || ($1>=3814 && $1<=3827) || ($1>=4126 && $1<=4131) || ($1>=4982 && $1<=4990) || ($1>=5706 && $1<=5710) || ($1>=6012 && $1<=6181) || ($1>=6285 && $1<=6339) || ($1>=6409 && $1<=6411) || ($1>=6596 && $1<=6700) || ($1>=7191 && $1<=7424) || ($1==8081) || ($1>=8550 && $1<=8551) || ($1>=8700 && $1<=8716) || ($1>=9324 && $1<=9326) || ($1>=9619 && $1<=9624) || ($1==9729) || ($1>=10018 && $1<=10064) || ($1>=10115 && $1<=10126) || ($1>=10198 && $1<=10386) || ($1==10486) || ($1>=12112 && $1<=12115) || ($1>=12209 && $1<=12211)){ print 8}
else if( ($1>=489 && $1<=498) || ($1>=505 && $1<=520) || ($1>=549 && $1<=553) || ($1>=638 && $1<=650) || ($1>=709 && $1<=1904) || ($1>=1943 && $1<=1948) || ($1>=1994 && $1<=2004) || ($1>=2064 && $1<=2090) || ($1>=2127 && $1<=2173) || ($1>=2194 && $1<=2258) || ($1>=2279 && $1<=2351) || ($1>=2363 && $1<=2372) || ($1==2393) || ($1>=2450 && $1<=2672) || ($1>=3474 && $1<=3485) || ($1>=4145 && $1<=4236) || ($1>=4890 && $1<=4915) || ($1>=5742 && $1<=6011) || ($1>=7003 && $1<=7034) || ($1>=7049 && $1<=7295) || ($1>=7425 && $1<=8080) || ($1==8084) || ($1>=8352 && $1<=8425) || ($1>=8461 && $1<=8495) || ($1>=8542 && $1<=8549) || ($1>=8717 && $1<=8856) || ($1>=9327 && $1<=9428) || ($1>=9625 && $1<=9673) || ($1>=9790 && $1<=9791) || ($1>=9793 && $1<=9801) || ($1>=9812 && $1<=9849) || ($1>=10010 && $1<=10017) || ($1>=10065 && $1<=10114) || ($1>=10128 && $1<=10130) || ($1>=10137 && $1<=10197) || ($1>=10387 && $1<=10395) || ($1>=10403 && $1<=10485) || ($1>=10978 && $1<=10999) || ($1>=11176 && $1<=11325) || ($1>=11620 && $1<=11687) || ($1>=11816 && $1<=11843) || ($1>=11939 && $1<=12099) || ($1>=12116 && $1<=12208) || ($1>=12220 && $1<=12307) || ($1>=12357 && $1<=12488) || ($1>=12598 && $1<=12806) || ($1>=12948 && $1<=12972) || ($1>=13216 && $1<=13306) || ($1>=13312 && $1<=13366) || ($1>=13415 && $1<=14615) || ($1>=14628 && $1<=15722) || ($1>=15989 && $1<=16833) || ($1>=17402 && $1<=17431)){ print 9}
}' tmp

This work as a charm until an update hit me. If until now, my maximum integer value was 17431. Now that I got 50421 witch means that I must declare more intervals on my solution but after inserting all the new intervals in their places, the script stop working with the error:

AWK argument list too long 

Do you have any idea how to operate on such a large number of intervals?

I was considering the fallowing: Given the fact that I can create a map file like:

Target,Index
103,1
104,1
105,2
106,5
107,8
108,9
109,6
110,9
111,6
112,9
113,9
114,9
115,9
116,9
117,9
118,9
119,9
120,9

Where Target is my number to look for and the Index is the value that will replace my Target number. How can I import the first column into an Array1 and the second column into Array2 and for every line from tmp file check the target value position form Array1 and print the same potion of Array2

Example: If 1917 is on potion Array1[1853] then print Array1[1853] (the value from the same potion), given the fact that the 2 Arrays are equal in terms of the number of elements.

The main quiescence is: There is any way to fix my script to accept all the new 6000 intervals? If AWK can not support it this way, what do you recommend?

Upvotes: 1

Views: 371

Answers (4)

Dudi Boy
Dudi Boy

Reputation: 4890

I suggest another solution for this problem. It will be easier to maintain and use lookup table to prevent repeated comparisons.

This is the initial set up to the first 6 filters.

The user is advised to add the filters 6,7,8,9 and increase loop counter from 5 to 9.

script.awk

# initialize filter range pairs
# each filter is a pair of lower/upper ranges
BEGIN {
     filterArr[0] = "363,499,4645,4646";
     filterArr[1] = "2174,2193";
     filterArr[2] = "500,500,12308,12356";
     filterArr[3] = "103,220,252,299,1980,1986,2921,2921";
     filterArr[4] = "221,251,8085,8091,8350,8350,12809,12945,16834,17033";
     filterArr[5] = "300,362,522,522,2923,2925,3441,3442,4644,4644,5677,5695,8082,8083,8093,8349,12946,12947,12986,13215,13309,13311";
}

# for each input line from input file, loop through all the filters (call scanFilter utility funtion).
{
     for (filter = 0; filter <= 1; filter++) scanFilter(filter);
}

# utility function to scan input field $1 to be in ranges
function scanFilter(filterIdx) {
     # if input field alread found, return its value
     if ($1 in foundAlready) {
          print foundAlready[$1];
          return;
     }
     # need to scan the input field through all filters
     rangesCount = split(filterArr[filterIdx], rangesArr, ",");
     # rangesCount holds the range pair values (lower, upper)
     # rangesArr hold rangePairs: rangesArr[1]=1st range lower,rangesArr[2]=1st range upper,rangesArr[3]=2nd range lower,rangesArr[4]=2nd range upper, etc

     for (rangeCounter = 1; rangeCounter <= rangesCount; rangeCounter += 2) {
          # test $1 for each upper range and lower range
          if ($1 >= rangesArr[rangeCounter] && $1 <= rangesArr[rangeCounter + 1]) {
               print filterIdx;
               foundAlready[$1] = filterIdx;
               return;
          }
     }
}

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203995

Even if you do have to hard-code your ranges, do it like this so you can validate that you don't have any holes in your ranges or any cases where the same value could be present in multiple ranges and can handle the input with just a hash lookup rather than having to loop through the ranges for every line of input:

$ cat tst.awk
BEGIN {
    # r[] = ranges[], v = value
    v = 0
    r[363,499]          = v
    r[4645,4646]        = v

    v = 1
    r[2174,2193]        = v
    r[500]              = v
    r[12308,12356]      = v

    populate(r,map)
}

$1 in map { print map[$1] }

function populate(ranges,map,    cnt,range,begend,beg,end,val,n,i) {
    for (range in ranges) {
        n = split(range,begend,SUBSEP)
        beg = begend[1]
        end = begend[n]
        val = ranges[range]
        for (i=beg; i<=end; i++) {
            map[i] = val
            cnt[i]++
        }
        min = ((min == "") || (min > beg) ? beg : min)
        max = ((max == "") || (max < end) ? end : max)
    }

    for (i=min; i<=max; i++) {
        if ( cnt[i] != 1 ) {
            if ( cnt[i] == 0 ) {
                printf "Hole: %d\n", i | "cat>&2"
            }
            else {
                printf "Overlap: %d\n", i | "cat>&2"
            }
        }
    }
}

.

$ echo 2180 | awk -f  tst.awk 2>/dev/null
1

$ echo 370 | awk -f  tst.awk 2>/dev/null
0

I'm redirecting stderr to /dev/null above since I haven't populated r[] fully and so there will be hundreds of holes being reported.

Obviously you can trivially populate r[] from a file rather than hard-coding the values in your script if you prefer but since that first part of the BEGIN section is the only place you'd specify the ranges and they're trivial to write so unlikely to mess up, keeping the data in the script itself isn't so bad in this case.

Upvotes: 4

KamilCuk
KamilCuk

Reputation: 141493

First create a map file like: first column is the min value, the second column is max, the third column is the output. Let's name the file mapfile.txt:

 363  499 0
4645 4646 0
2174 2193 1
...

Then run awk such as (untested, typos expected):

awk 'FNR == NR { ++i; min[i]=$1; max[i]=$2; result[i]=$3; }
     FNR != NR { 
          for (j = 1; j <= i; ++j) { 
              if (min[j] <= $1 && $1 <= max[j]) {
                  print result[j];
                  break
              } 
          }
     }
' mapfile.txt second_column_values.txt

First we read the map file into memory and three arrays. Then we check the value for min/max and print the result if found. Then if the result is found - we break from the loop.

Alternatively, if you have work with huge files, you could do this:

  1. First create another mapfile such the first column is the value and the second is the result. It could be generated from the mapfile.txt above with something like while read a b c; do seq -f "%.0f $c" $a $b; done < mapfile.txt

363 0
364 0
365 0
...
  1. Remember to sort this file. Let's call it mapfile2.txt

  2. Then having numbers from column 2, add a line number on each line, sort it on the second column, then join it with the mapfile2.txt, re-sort on line numbers and remove line numbers.

nl -w1 second_column_values.txt | sort -s -k2 |
join -12 -21 - <(<mapfile2.txt sort -s -k1) |
sort -s -k1 | cut -f2-

Or do the same without numbering the lines if the order of lines does not matter. I think sort+join-ing the files could be faster then plain array lookup with range comparison in very extreme cases.

Upvotes: 3

Dudi Boy
Dudi Boy

Reputation: 4890

Here is a small change that will work. Create an awk script file and than run it.

script.awk

## Perform an AWK comparation between the integers intervals and the terget number.
##In the end write everything out in the new column .
{ 
if( ($1>=363 && $1<=499) || ($1>=4645 && $1<=4646)){ print 0}  
else if( ($1>=2174 && $1<=2193)) { print 1}  
else if( ($1==500) || ($1>=12308 && $1<=12356)){ print 2} 
else if( ($1>=103 && $1<=220) || ($1>=252 && $1<=299) || ($1>=1980 && $1<=1986) || ($1>=2921 && $1<=2922)){ print 3} 
else if( ($1>=221 && $1<=251) || ($1>=8085 && $1<=8091) || ($1==8350) || ($1>=12809 && $1<=12945) || ($1>=16834 && $1<=17033)){ print 4} 
else if( ($1>=300 && $1<=362) || ($1==522) || ($1>=2923 && $1<=2925) || ($1>=3441 && $1<=3442) || ($1==4644)|| ($1>=5677 && $1<=5695) || ($1>=8082 && $1<=8083)|| ($1>=8093 && $1<=8349) || ($1>=12946 && $1<=12947) || ($1>=12986 && $1<=13215) || ($1>=13309 && $1<=13311)){ print 5}
else if( ($1>=501 && $1<=504) || ($1>=566 && $1<=600) || ($1>=613 && $1<=637) ||  ($1>=2015 && $1<=2040) ||  ($1>=2103 && $1<=2126) || ($1>=2373 && $1<=2374) || ($1>=3828 && $1<=4125) || ($1>=4237 && $1<=4636) || ($1>=4647 && $1<=4889) || ($1>=4991 && $1<=5676) || ($1>=5696 && $1<=5705) || ($1>=6502 && $1<=6595) || ($1>=8429 && $1<=8460) || ($1>=8552 && $1<=8699) || ($1>=10487 && $1<=10977) || ($1>=11326 && $1<=11617) || ($1>=11688 && $1<=11815) || ($1>=11844 && $1<=11938) || ($1>=12490 && $1<=12597) || ($1>=12973 && $1<=12982) || ($1>=13367 && $1<=13414)){ print 6}
else if( ($1>=523 && $1<=548) || ($1>=555 && $1<=565) || ($1>=2005 && $1<=2014) || ($1>=2041 && $1<=2063) || ($1>=2091 && $1<=2102) ||  ($1==2394) || ($1>=2407 && $1<=2411) || ($1>=2926 && $1<=3008) || ($1>=3443 && $1<=3473) || ($1>=3486 && $1<=3813) || ($1>=4132 && $1<=4144) || ($1>=4637 && $1<=4643) || ($1>=4916 && $1<=4981) || ($1>=5711 && $1<=5741) || ($1>=6403 && $1<=6405) || ($1>=6415 && $1<=6466) || ($1>=6701 && $1<=7002) || ($1>=7035 && $1<=7048) || ($1>=8426 && $1<=8428) || ($1>=8496 && $1<=8541) || ($1>=8857 && $1<=9323) || ($1>=9429 && $1<=9618) || ($1>=9674 && $1<=9789) || ($1>=9802 && $1<=9811) || ($1>=9850 && $1<=10009) || ($1>=10131 && $1<=10136) || ($1>=10396 && $1<=10402) || ($1>=11000 && $1<=11175) || ($1==11618) || ($1>=12100 && $1<=12111) || ($1>=12212 && $1<=12219) || ($1==12489) || ($1>=12807 && $1<=12808) || ($1==12983) || ($1>=14616 && $1<=14627) || ($1>=15723 && $1<=15897)){ print 7}
else if( ($1==521) || ($1==554) || ($1>=601 && $1<=612) || ($1>=651 && $1<=708) || ($1>=1905 && $1<=1942) || ($1>=1949 && $1<=1979) || ($1>=1987 && $1<=1993) || ($1>=2259 && $1<=2278) || ($1>=2352 && $1<=2362) || ($1>=2395 && $1<=2406) || ($1>=2412 && $1<=2449) || ($1>=2673 && $1<=2919) || ($1>=3009 && $1<=3016) || ($1>=3814 && $1<=3827) || ($1>=4126 && $1<=4131) || ($1>=4982 && $1<=4990) || ($1>=5706 && $1<=5710) || ($1>=6012 && $1<=6181) || ($1>=6285 && $1<=6339) || ($1>=6409 && $1<=6411) || ($1>=6596 && $1<=6700) || ($1>=7191 && $1<=7424) || ($1==8081) || ($1>=8550 && $1<=8551) || ($1>=8700 && $1<=8716) || ($1>=9324 && $1<=9326) || ($1>=9619 && $1<=9624) || ($1==9729) || ($1>=10018 && $1<=10064) || ($1>=10115 && $1<=10126) || ($1>=10198 && $1<=10386) || ($1==10486) || ($1>=12112 && $1<=12115) || ($1>=12209 && $1<=12211)){ print 8}
else if( ($1>=489 && $1<=498) || ($1>=505 && $1<=520) || ($1>=549 && $1<=553) || ($1>=638 && $1<=650) || ($1>=709 && $1<=1904) || ($1>=1943 && $1<=1948) || ($1>=1994 && $1<=2004) || ($1>=2064 && $1<=2090) || ($1>=2127 && $1<=2173) || ($1>=2194 && $1<=2258) || ($1>=2279 && $1<=2351) || ($1>=2363 && $1<=2372) || ($1==2393) || ($1>=2450 && $1<=2672) || ($1>=3474 && $1<=3485) || ($1>=4145 && $1<=4236) || ($1>=4890 && $1<=4915) || ($1>=5742 && $1<=6011) || ($1>=7003 && $1<=7034) || ($1>=7049 && $1<=7295) || ($1>=7425 && $1<=8080) || ($1==8084) || ($1>=8352 && $1<=8425) || ($1>=8461 && $1<=8495) || ($1>=8542 && $1<=8549) || ($1>=8717 && $1<=8856) || ($1>=9327 && $1<=9428) || ($1>=9625 && $1<=9673) || ($1>=9790 && $1<=9791) || ($1>=9793 && $1<=9801) || ($1>=9812 && $1<=9849) || ($1>=10010 && $1<=10017) || ($1>=10065 && $1<=10114) || ($1>=10128 && $1<=10130) || ($1>=10137 && $1<=10197) || ($1>=10387 && $1<=10395) || ($1>=10403 && $1<=10485) || ($1>=10978 && $1<=10999) || ($1>=11176 && $1<=11325) || ($1>=11620 && $1<=11687) || ($1>=11816 && $1<=11843) || ($1>=11939 && $1<=12099) || ($1>=12116 && $1<=12208) || ($1>=12220 && $1<=12307) || ($1>=12357 && $1<=12488) || ($1>=12598 && $1<=12806) || ($1>=12948 && $1<=12972) || ($1>=13216 && $1<=13306) || ($1>=13312 && $1<=13366) || ($1>=13415 && $1<=14615) || ($1>=14628 && $1<=15722) || ($1>=15989 && $1<=16833) || ($1>=17402 && $1<=17431)){ print 9}
}

running the script:

awk -f script.awk input.csv

Upvotes: 1

Related Questions