Reputation: 3178
I have two files and i want to divide file1 into number of parts which are equals to number of lines in file2.
Here each line from file2 will be compared to file1 and all the matched records from file1 are formed as new file.
file1
<AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974096" STATUS="0" SQLTEXT="SHOW COLLATION"/>
<AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974099" STATUS="0" SQLTEXT="SHOW TABLES"/>
<AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974095" STATUS="0" SQLTEXT="SHOW COLLATION"/>
<AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974094" STATUS="0" SQLTEXT="SHOW COLLATION"/>
<AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974099" STATUS="0" SQLTEXT="SHOW COLLATION"/>
<AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974094" STATUS="0" SQLTEXT="SET NAMES utf8"/>
file2
21974096
21974100
21974095
21974094
21974099
required output:
21974094.txt
==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SET NAMES utf8"
==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SHOW COLLATION"
21974099.txt
==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SHOW COLLATION"
==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SHOW TABLES"
21974095.txt
==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SHOW COLLATION"
...
To achieve this with awk i have written a code like below but it is not satisfying my requirement.
awk 'NR==FNR{a[$4];next}!($4 in a){ print $2 "\n" $6 "\n=========\n" > $4 ".txt"}' file2 file1
Can any one please help me how to achieve the above requirement with awk or any other shell command. (It has to generate minimum of 10000 files and it should complete generating files in max 10 minutes.)
If i have executed below command, it is almost reached but not completely satisfying the requirement.
awk 'NR==FNR{a[$1];next}{split($4,b,"\"")}(b[2] in a){print $2"\n"$10"\n=========\n" > b[2]".txt"}' file2 file1
OUTPUT from one of the file
TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="/*
=========
TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="/*
=========
TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="SHOW
But i want OUTPUT like below
TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="/*show variables"
=========
TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="/* select * from table "
=========
TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="SHOW collations "
Means here i need to specify delimiter for only file1 as '"' and no delimiters for file2..
Can any one help regarding this ?
Upvotes: 1
Views: 520
Reputation: 75618
This one's my solution for that:
#!/usr/bin/gawk -f
BEGIN {
f = ARGV[2]
while (getline id < f) {
ids[id] = 0
}
ARGV[2] = ""
}
match($0, /.*<AUDIT_RECORD.* (TIMESTAMP="[^"]*").* CONNECTION_ID="([^"]*)".* (SQLTEXT="[^"]*").*/, a) {
id = a[2]
if (id in ids) {
key = id "|" ids[id]++
timestamps[key] = a[1]
sqltexts[key] = a[3]
}
}
END {
for (id in ids) {
count = ids[id]
if (count) {
file = id ".txt"
for (i = 0; i < count; ++i) {
key = id "|" i
printf "%s\n%s\n%s\n\n", "==========================================", timestamps[key], sqltexts[key] > file
}
close(file)
}
}
}
Run it with
gawk -f script.awk file1 file2
I actually prefer to preprocess the second file before entering the loop for the first file since I don't like adding unecessary condition checking.
You can also just modify printf
if it's not the exact output you wanted.
Among other solutions actually, gawk's match could allow one to be specific on which parameters should really be targeted so the above code could also work if other lines would have slightly different format like other extra key/value pairs around.
UPDATE
This one's simpler but risks opening too many output files at once depending on the number of entries.
#!/usr/bin/gawk -f
BEGIN {
f = ARGV[2]
while (getline id < f) {
ids[id] = 0
}
ARGV[2] = ""
}
match($0, /.*<AUDIT_RECORD.* (TIMESTAMP="[^"]*").* CONNECTION_ID="([^"]*)".* (SQLTEXT="[^"]*").*/, a) {
id = a[2]
if (id in ids) {
printf "%s\n%s\n%s\n\n", "==========================================", a[1], a[3] > id ".txt"
}
}
Upvotes: 2
Reputation: 33387
You should use the split function to separate the fourth word by the "
character, so that you get the number that matches the values in file2. You should also a[$1]
in the firth block.
This script should work:
awk 'NR==FNR{a[$1];next}{split($4,b,"\"")}(b[2] in a){print $2"\n"$6"\n=========\n" > b[2]".txt"}' file2 file1
UPDATE:
As long as there are not extra quotes in file1, we can use the "
character as a field separator:
awk -F\" 'NR==FNR{a[$1];next}($6 in a){print "TIMESTAMP=\""$2"\"\nSQLTEXT=\""$10"\"\n=========\n" > $6".txt"}' file2 file1
We split the input file with the "
character as a delimiter, so field $2 is the timestamp, field $6 is the conn. id, and filed $10 is the SQLTEXT.
The first block NR==FNR{a[$1];next}
populates the array with the coonection ids from file2 ( NR == FNR for the first file). With ($6 in a)
we filter the lines of the second file (since we called next in the first block), for which the connection id is an index of table a
. If we find a match, then the block {print "TIMESTAMP=\""$2"\"\nSQLTEXT=\""$10"\"\n=========\n" > $6".txt"}'
is executed, which prints the relevant information to the file conn_id.txt
Upvotes: 3
Reputation: 58324
This doesn't use awk
but it works:
while read -r n
do
echo "Generating $n.txt"
grep $n file1 | sed 's/^.*\(TIMESTAMP="[^"]\+"\).*\(SQLTEXT="[^"]\+"\).*$/=======================\n\1\n\2\n/' > $n.txt
# If you don't want an empty file when there's no match, add this line
if [ ! -s $n.txt ]; then rm -f $n.txt ; fi
done < file2
Upvotes: 2