Reputation: 109
I must process a CSV file, and I need to repace some values from a column with a specific index number. Example
10/06-14:04:21.082467 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:22.082715 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:23.082940 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:24.083256 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:27.421793 ,1418,64878,192.168.0.13,161,0xC498BF38,0x0,,
10/06-14:04:27.522099 ,1418,,64879,192.168.0.13,161,0xC499BF39,0x0,,
10/06-14:04:33.445012 ,1421,64878,192.168.0.13,705,0xC498BF38,0x0,,
10/06-14:04:33.545144 ,1421,192.168.0.130xC498BF38,0x0,,
The column of interest is Column number 2. I extract it into a separate tmp file and I looks like this:
1917
1917
1917
1917
1418
1418
1421
1421
My expected output is:
5
5
5
5
2
2
0
0
Until now I use the script below:
csvtool col 2 $file>tmp && echo " TEMPORARY Column DONE "&& echo "Start the Magic"
## Perform an AWK comparation between the integers intervals and the terget number.
##In the end write everything out in the new column .
awk '{
if( ($1>=363 && $1<=499) || ($1>=4645 && $1<=4646)){ print 0}
else if( ($1>=2174 && $1<=2193)) { print 1}
else if( ($1==500) || ($1>=12308 && $1<=12356)){ print 2}
else if( ($1>=103 && $1<=220) || ($1>=252 && $1<=299) || ($1>=1980 && $1<=1986) || ($1>=2921 && $1<=2922)){ print 3}
else if( ($1>=221 && $1<=251) || ($1>=8085 && $1<=8091) || ($1==8350) || ($1>=12809 && $1<=12945) || ($1>=16834 && $1<=17033)){ print 4}
else if( ($1>=300 && $1<=362) || ($1==522) || ($1>=2923 && $1<=2925) || ($1>=3441 && $1<=3442) || ($1==4644)|| ($1>=5677 && $1<=5695) || ($1>=8082 && $1<=8083)|| ($1>=8093 && $1<=8349) || ($1>=12946 && $1<=12947) || ($1>=12986 && $1<=13215) || ($1>=13309 && $1<=13311)){ print 5}
else if( ($1>=501 && $1<=504) || ($1>=566 && $1<=600) || ($1>=613 && $1<=637) || ($1>=2015 && $1<=2040) || ($1>=2103 && $1<=2126) || ($1>=2373 && $1<=2374) || ($1>=3828 && $1<=4125) || ($1>=4237 && $1<=4636) || ($1>=4647 && $1<=4889) || ($1>=4991 && $1<=5676) || ($1>=5696 && $1<=5705) || ($1>=6502 && $1<=6595) || ($1>=8429 && $1<=8460) || ($1>=8552 && $1<=8699) || ($1>=10487 && $1<=10977) || ($1>=11326 && $1<=11617) || ($1>=11688 && $1<=11815) || ($1>=11844 && $1<=11938) || ($1>=12490 && $1<=12597) || ($1>=12973 && $1<=12982) || ($1>=13367 && $1<=13414)){ print 6}
else if( ($1>=523 && $1<=548) || ($1>=555 && $1<=565) || ($1>=2005 && $1<=2014) || ($1>=2041 && $1<=2063) || ($1>=2091 && $1<=2102) || ($1==2394) || ($1>=2407 && $1<=2411) || ($1>=2926 && $1<=3008) || ($1>=3443 && $1<=3473) || ($1>=3486 && $1<=3813) || ($1>=4132 && $1<=4144) || ($1>=4637 && $1<=4643) || ($1>=4916 && $1<=4981) || ($1>=5711 && $1<=5741) || ($1>=6403 && $1<=6405) || ($1>=6415 && $1<=6466) || ($1>=6701 && $1<=7002) || ($1>=7035 && $1<=7048) || ($1>=8426 && $1<=8428) || ($1>=8496 && $1<=8541) || ($1>=8857 && $1<=9323) || ($1>=9429 && $1<=9618) || ($1>=9674 && $1<=9789) || ($1>=9802 && $1<=9811) || ($1>=9850 && $1<=10009) || ($1>=10131 && $1<=10136) || ($1>=10396 && $1<=10402) || ($1>=11000 && $1<=11175) || ($1==11618) || ($1>=12100 && $1<=12111) || ($1>=12212 && $1<=12219) || ($1==12489) || ($1>=12807 && $1<=12808) || ($1==12983) || ($1>=14616 && $1<=14627) || ($1>=15723 && $1<=15897)){ print 7}
else if( ($1==521) || ($1==554) || ($1>=601 && $1<=612) || ($1>=651 && $1<=708) || ($1>=1905 && $1<=1942) || ($1>=1949 && $1<=1979) || ($1>=1987 && $1<=1993) || ($1>=2259 && $1<=2278) || ($1>=2352 && $1<=2362) || ($1>=2395 && $1<=2406) || ($1>=2412 && $1<=2449) || ($1>=2673 && $1<=2919) || ($1>=3009 && $1<=3016) || ($1>=3814 && $1<=3827) || ($1>=4126 && $1<=4131) || ($1>=4982 && $1<=4990) || ($1>=5706 && $1<=5710) || ($1>=6012 && $1<=6181) || ($1>=6285 && $1<=6339) || ($1>=6409 && $1<=6411) || ($1>=6596 && $1<=6700) || ($1>=7191 && $1<=7424) || ($1==8081) || ($1>=8550 && $1<=8551) || ($1>=8700 && $1<=8716) || ($1>=9324 && $1<=9326) || ($1>=9619 && $1<=9624) || ($1==9729) || ($1>=10018 && $1<=10064) || ($1>=10115 && $1<=10126) || ($1>=10198 && $1<=10386) || ($1==10486) || ($1>=12112 && $1<=12115) || ($1>=12209 && $1<=12211)){ print 8}
else if( ($1>=489 && $1<=498) || ($1>=505 && $1<=520) || ($1>=549 && $1<=553) || ($1>=638 && $1<=650) || ($1>=709 && $1<=1904) || ($1>=1943 && $1<=1948) || ($1>=1994 && $1<=2004) || ($1>=2064 && $1<=2090) || ($1>=2127 && $1<=2173) || ($1>=2194 && $1<=2258) || ($1>=2279 && $1<=2351) || ($1>=2363 && $1<=2372) || ($1==2393) || ($1>=2450 && $1<=2672) || ($1>=3474 && $1<=3485) || ($1>=4145 && $1<=4236) || ($1>=4890 && $1<=4915) || ($1>=5742 && $1<=6011) || ($1>=7003 && $1<=7034) || ($1>=7049 && $1<=7295) || ($1>=7425 && $1<=8080) || ($1==8084) || ($1>=8352 && $1<=8425) || ($1>=8461 && $1<=8495) || ($1>=8542 && $1<=8549) || ($1>=8717 && $1<=8856) || ($1>=9327 && $1<=9428) || ($1>=9625 && $1<=9673) || ($1>=9790 && $1<=9791) || ($1>=9793 && $1<=9801) || ($1>=9812 && $1<=9849) || ($1>=10010 && $1<=10017) || ($1>=10065 && $1<=10114) || ($1>=10128 && $1<=10130) || ($1>=10137 && $1<=10197) || ($1>=10387 && $1<=10395) || ($1>=10403 && $1<=10485) || ($1>=10978 && $1<=10999) || ($1>=11176 && $1<=11325) || ($1>=11620 && $1<=11687) || ($1>=11816 && $1<=11843) || ($1>=11939 && $1<=12099) || ($1>=12116 && $1<=12208) || ($1>=12220 && $1<=12307) || ($1>=12357 && $1<=12488) || ($1>=12598 && $1<=12806) || ($1>=12948 && $1<=12972) || ($1>=13216 && $1<=13306) || ($1>=13312 && $1<=13366) || ($1>=13415 && $1<=14615) || ($1>=14628 && $1<=15722) || ($1>=15989 && $1<=16833) || ($1>=17402 && $1<=17431)){ print 9}
}' tmp
This work as a charm until an update hit me. If until now, my maximum integer value was 17431. Now that I got 50421 witch means that I must declare more intervals on my solution but after inserting all the new intervals in their places, the script stop working with the error:
AWK argument list too long
Do you have any idea how to operate on such a large number of intervals?
I was considering the fallowing: Given the fact that I can create a map file like:
Target,Index
103,1
104,1
105,2
106,5
107,8
108,9
109,6
110,9
111,6
112,9
113,9
114,9
115,9
116,9
117,9
118,9
119,9
120,9
Where Target is my number to look for and the Index is the value that will replace my Target number.
How can I import the first column into an Array1 and the second column into Array2 and for every line from tmp
file check the target value position form Array1 and print the same potion of Array2
Example: If 1917 is on potion Array1[1853] then print Array1[1853] (the value from the same potion), given the fact that the 2 Arrays are equal in terms of the number of elements.
The main quiescence is: There is any way to fix my script to accept all the new 6000 intervals? If AWK can not support it this way, what do you recommend?
Upvotes: 1
Views: 371
Reputation: 4890
I suggest another solution for this problem. It will be easier to maintain and use lookup table to prevent repeated comparisons.
This is the initial set up to the first 6 filters.
The user is advised to add the filters 6,7,8,9 and increase loop counter from 5 to 9.
script.awk
# initialize filter range pairs
# each filter is a pair of lower/upper ranges
BEGIN {
filterArr[0] = "363,499,4645,4646";
filterArr[1] = "2174,2193";
filterArr[2] = "500,500,12308,12356";
filterArr[3] = "103,220,252,299,1980,1986,2921,2921";
filterArr[4] = "221,251,8085,8091,8350,8350,12809,12945,16834,17033";
filterArr[5] = "300,362,522,522,2923,2925,3441,3442,4644,4644,5677,5695,8082,8083,8093,8349,12946,12947,12986,13215,13309,13311";
}
# for each input line from input file, loop through all the filters (call scanFilter utility funtion).
{
for (filter = 0; filter <= 1; filter++) scanFilter(filter);
}
# utility function to scan input field $1 to be in ranges
function scanFilter(filterIdx) {
# if input field alread found, return its value
if ($1 in foundAlready) {
print foundAlready[$1];
return;
}
# need to scan the input field through all filters
rangesCount = split(filterArr[filterIdx], rangesArr, ",");
# rangesCount holds the range pair values (lower, upper)
# rangesArr hold rangePairs: rangesArr[1]=1st range lower,rangesArr[2]=1st range upper,rangesArr[3]=2nd range lower,rangesArr[4]=2nd range upper, etc
for (rangeCounter = 1; rangeCounter <= rangesCount; rangeCounter += 2) {
# test $1 for each upper range and lower range
if ($1 >= rangesArr[rangeCounter] && $1 <= rangesArr[rangeCounter + 1]) {
print filterIdx;
foundAlready[$1] = filterIdx;
return;
}
}
}
Upvotes: 0
Reputation: 203995
Even if you do have to hard-code your ranges, do it like this so you can validate that you don't have any holes in your ranges or any cases where the same value could be present in multiple ranges and can handle the input with just a hash lookup rather than having to loop through the ranges for every line of input:
$ cat tst.awk
BEGIN {
# r[] = ranges[], v = value
v = 0
r[363,499] = v
r[4645,4646] = v
v = 1
r[2174,2193] = v
r[500] = v
r[12308,12356] = v
populate(r,map)
}
$1 in map { print map[$1] }
function populate(ranges,map, cnt,range,begend,beg,end,val,n,i) {
for (range in ranges) {
n = split(range,begend,SUBSEP)
beg = begend[1]
end = begend[n]
val = ranges[range]
for (i=beg; i<=end; i++) {
map[i] = val
cnt[i]++
}
min = ((min == "") || (min > beg) ? beg : min)
max = ((max == "") || (max < end) ? end : max)
}
for (i=min; i<=max; i++) {
if ( cnt[i] != 1 ) {
if ( cnt[i] == 0 ) {
printf "Hole: %d\n", i | "cat>&2"
}
else {
printf "Overlap: %d\n", i | "cat>&2"
}
}
}
}
.
$ echo 2180 | awk -f tst.awk 2>/dev/null
1
$ echo 370 | awk -f tst.awk 2>/dev/null
0
I'm redirecting stderr to /dev/null above since I haven't populated r[]
fully and so there will be hundreds of holes being reported.
Obviously you can trivially populate r[]
from a file rather than hard-coding the values in your script if you prefer but since that first part of the BEGIN section is the only place you'd specify the ranges and they're trivial to write so unlikely to mess up, keeping the data in the script itself isn't so bad in this case.
Upvotes: 4
Reputation: 141493
First create a map file like: first column is the min value, the second column is max, the third column is the output. Let's name the file mapfile.txt
:
363 499 0
4645 4646 0
2174 2193 1
...
Then run awk such as (untested, typos expected):
awk 'FNR == NR { ++i; min[i]=$1; max[i]=$2; result[i]=$3; }
FNR != NR {
for (j = 1; j <= i; ++j) {
if (min[j] <= $1 && $1 <= max[j]) {
print result[j];
break
}
}
}
' mapfile.txt second_column_values.txt
First we read the map file into memory and three arrays. Then we check the value for min/max and print the result if found. Then if the result is found - we break from the loop.
Alternatively, if you have work with huge files, you could do this:
while read a b c; do seq -f "%.0f $c" $a $b; done < mapfile.txt
363 0
364 0
365 0
...
Remember to sort this file. Let's call it mapfile2.txt
Then having numbers from column 2, add a line number on each line, sort it on the second column, then join
it with the mapfile2.txt
, re-sort on line numbers and remove line numbers.
nl -w1 second_column_values.txt | sort -s -k2 |
join -12 -21 - <(<mapfile2.txt sort -s -k1) |
sort -s -k1 | cut -f2-
Or do the same without numbering the lines if the order of lines does not matter. I think sort
+join
-ing the files could be faster then plain array lookup with range comparison in very extreme cases.
Upvotes: 3
Reputation: 4890
Here is a small change that will work.
Create an awk
script file and than run it.
script.awk
## Perform an AWK comparation between the integers intervals and the terget number.
##In the end write everything out in the new column .
{
if( ($1>=363 && $1<=499) || ($1>=4645 && $1<=4646)){ print 0}
else if( ($1>=2174 && $1<=2193)) { print 1}
else if( ($1==500) || ($1>=12308 && $1<=12356)){ print 2}
else if( ($1>=103 && $1<=220) || ($1>=252 && $1<=299) || ($1>=1980 && $1<=1986) || ($1>=2921 && $1<=2922)){ print 3}
else if( ($1>=221 && $1<=251) || ($1>=8085 && $1<=8091) || ($1==8350) || ($1>=12809 && $1<=12945) || ($1>=16834 && $1<=17033)){ print 4}
else if( ($1>=300 && $1<=362) || ($1==522) || ($1>=2923 && $1<=2925) || ($1>=3441 && $1<=3442) || ($1==4644)|| ($1>=5677 && $1<=5695) || ($1>=8082 && $1<=8083)|| ($1>=8093 && $1<=8349) || ($1>=12946 && $1<=12947) || ($1>=12986 && $1<=13215) || ($1>=13309 && $1<=13311)){ print 5}
else if( ($1>=501 && $1<=504) || ($1>=566 && $1<=600) || ($1>=613 && $1<=637) || ($1>=2015 && $1<=2040) || ($1>=2103 && $1<=2126) || ($1>=2373 && $1<=2374) || ($1>=3828 && $1<=4125) || ($1>=4237 && $1<=4636) || ($1>=4647 && $1<=4889) || ($1>=4991 && $1<=5676) || ($1>=5696 && $1<=5705) || ($1>=6502 && $1<=6595) || ($1>=8429 && $1<=8460) || ($1>=8552 && $1<=8699) || ($1>=10487 && $1<=10977) || ($1>=11326 && $1<=11617) || ($1>=11688 && $1<=11815) || ($1>=11844 && $1<=11938) || ($1>=12490 && $1<=12597) || ($1>=12973 && $1<=12982) || ($1>=13367 && $1<=13414)){ print 6}
else if( ($1>=523 && $1<=548) || ($1>=555 && $1<=565) || ($1>=2005 && $1<=2014) || ($1>=2041 && $1<=2063) || ($1>=2091 && $1<=2102) || ($1==2394) || ($1>=2407 && $1<=2411) || ($1>=2926 && $1<=3008) || ($1>=3443 && $1<=3473) || ($1>=3486 && $1<=3813) || ($1>=4132 && $1<=4144) || ($1>=4637 && $1<=4643) || ($1>=4916 && $1<=4981) || ($1>=5711 && $1<=5741) || ($1>=6403 && $1<=6405) || ($1>=6415 && $1<=6466) || ($1>=6701 && $1<=7002) || ($1>=7035 && $1<=7048) || ($1>=8426 && $1<=8428) || ($1>=8496 && $1<=8541) || ($1>=8857 && $1<=9323) || ($1>=9429 && $1<=9618) || ($1>=9674 && $1<=9789) || ($1>=9802 && $1<=9811) || ($1>=9850 && $1<=10009) || ($1>=10131 && $1<=10136) || ($1>=10396 && $1<=10402) || ($1>=11000 && $1<=11175) || ($1==11618) || ($1>=12100 && $1<=12111) || ($1>=12212 && $1<=12219) || ($1==12489) || ($1>=12807 && $1<=12808) || ($1==12983) || ($1>=14616 && $1<=14627) || ($1>=15723 && $1<=15897)){ print 7}
else if( ($1==521) || ($1==554) || ($1>=601 && $1<=612) || ($1>=651 && $1<=708) || ($1>=1905 && $1<=1942) || ($1>=1949 && $1<=1979) || ($1>=1987 && $1<=1993) || ($1>=2259 && $1<=2278) || ($1>=2352 && $1<=2362) || ($1>=2395 && $1<=2406) || ($1>=2412 && $1<=2449) || ($1>=2673 && $1<=2919) || ($1>=3009 && $1<=3016) || ($1>=3814 && $1<=3827) || ($1>=4126 && $1<=4131) || ($1>=4982 && $1<=4990) || ($1>=5706 && $1<=5710) || ($1>=6012 && $1<=6181) || ($1>=6285 && $1<=6339) || ($1>=6409 && $1<=6411) || ($1>=6596 && $1<=6700) || ($1>=7191 && $1<=7424) || ($1==8081) || ($1>=8550 && $1<=8551) || ($1>=8700 && $1<=8716) || ($1>=9324 && $1<=9326) || ($1>=9619 && $1<=9624) || ($1==9729) || ($1>=10018 && $1<=10064) || ($1>=10115 && $1<=10126) || ($1>=10198 && $1<=10386) || ($1==10486) || ($1>=12112 && $1<=12115) || ($1>=12209 && $1<=12211)){ print 8}
else if( ($1>=489 && $1<=498) || ($1>=505 && $1<=520) || ($1>=549 && $1<=553) || ($1>=638 && $1<=650) || ($1>=709 && $1<=1904) || ($1>=1943 && $1<=1948) || ($1>=1994 && $1<=2004) || ($1>=2064 && $1<=2090) || ($1>=2127 && $1<=2173) || ($1>=2194 && $1<=2258) || ($1>=2279 && $1<=2351) || ($1>=2363 && $1<=2372) || ($1==2393) || ($1>=2450 && $1<=2672) || ($1>=3474 && $1<=3485) || ($1>=4145 && $1<=4236) || ($1>=4890 && $1<=4915) || ($1>=5742 && $1<=6011) || ($1>=7003 && $1<=7034) || ($1>=7049 && $1<=7295) || ($1>=7425 && $1<=8080) || ($1==8084) || ($1>=8352 && $1<=8425) || ($1>=8461 && $1<=8495) || ($1>=8542 && $1<=8549) || ($1>=8717 && $1<=8856) || ($1>=9327 && $1<=9428) || ($1>=9625 && $1<=9673) || ($1>=9790 && $1<=9791) || ($1>=9793 && $1<=9801) || ($1>=9812 && $1<=9849) || ($1>=10010 && $1<=10017) || ($1>=10065 && $1<=10114) || ($1>=10128 && $1<=10130) || ($1>=10137 && $1<=10197) || ($1>=10387 && $1<=10395) || ($1>=10403 && $1<=10485) || ($1>=10978 && $1<=10999) || ($1>=11176 && $1<=11325) || ($1>=11620 && $1<=11687) || ($1>=11816 && $1<=11843) || ($1>=11939 && $1<=12099) || ($1>=12116 && $1<=12208) || ($1>=12220 && $1<=12307) || ($1>=12357 && $1<=12488) || ($1>=12598 && $1<=12806) || ($1>=12948 && $1<=12972) || ($1>=13216 && $1<=13306) || ($1>=13312 && $1<=13366) || ($1>=13415 && $1<=14615) || ($1>=14628 && $1<=15722) || ($1>=15989 && $1<=16833) || ($1>=17402 && $1<=17431)){ print 9}
}
running the script:
awk -f script.awk input.csv
Upvotes: 1