Reputation: 3178

Split a file based on the string from the other file

I have two files and i want to divide file1 into number of parts which are equals to number of lines in file2.

Here each line from file2 will be compared to file1 and all the matched records from file1 are formed as new file.

file1

  <AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974096" STATUS="0" SQLTEXT="SHOW COLLATION"/> 
  <AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974099" STATUS="0" SQLTEXT="SHOW TABLES"/> 
  <AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974095" STATUS="0" SQLTEXT="SHOW COLLATION"/> 
  <AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974094" STATUS="0" SQLTEXT="SHOW COLLATION"/> 
  <AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974099" STATUS="0" SQLTEXT="SHOW COLLATION"/> 
  <AUDIT_RECORD TIMESTAMP="2013-08-26T19:31:17" NAME="Query" CONNECTION_ID="21974094" STATUS="0" SQLTEXT="SET NAMES utf8"/>

file2

required output:

21974094.txt

==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SET NAMES utf8"

==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SHOW COLLATION"


21974099.txt

==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SHOW COLLATION"

==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SHOW TABLES"


21974095.txt

==========================================
TIMESTAMP="2013-08-26T19:31:17"
SQLTEXT="SHOW COLLATION"

...

To achieve this with awk i have written a code like below but it is not satisfying my requirement.

awk 'NR==FNR{a[$4];next}!($4 in a){ print $2 "\n" $6 "\n=========\n" > $4 ".txt"}' file2  file1

Can any one please help me how to achieve the above requirement with awk or any other shell command. (It has to generate minimum of 10000 files and it should complete generating files in max 10 minutes.)

1st TRY

If i have executed below command, it is almost reached but not completely satisfying the requirement.

awk 'NR==FNR{a[$1];next}{split($4,b,"\"")}(b[2] in a){print $2"\n"$10"\n=========\n" > b[2]".txt"}' file2 file1

OUTPUT from one of the file

TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="/*
=========

TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="/*
=========

TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="SHOW

But i want OUTPUT like below

TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="/*show variables"
=========

TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="/* select * from table "
=========

TIMESTAMP="2013-08-26T19:57:34"
SQLTEXT="SHOW collations "

Means here i need to specify delimiter for only file1 as '"' and no delimiters for file2..

Can any one help regarding this ?

Upvotes: 1

Answers (3)

konsolebox

Reputation: 75618

This one's my solution for that:

#!/usr/bin/gawk -f

BEGIN {
    f = ARGV[2]
    while (getline id < f) {
        ids[id] = 0
    }
    ARGV[2] = ""
}

match($0, /.*<AUDIT_RECORD.* (TIMESTAMP="[^"]*").* CONNECTION_ID="([^"]*)".* (SQLTEXT="[^"]*").*/, a) {
    id = a[2]
    if (id in ids) {
        key = id "|" ids[id]++
        timestamps[key] = a[1]
        sqltexts[key] = a[3]
    }
}

END {
    for (id in ids) {
        count = ids[id]
        if (count) {
            file = id ".txt"
            for (i = 0; i < count; ++i) {
                key = id "|" i
                printf "%s\n%s\n%s\n\n", "==========================================", timestamps[key], sqltexts[key] > file
            }
            close(file)
        }
    }
}

Run it with

gawk -f script.awk file1 file2

I actually prefer to preprocess the second file before entering the loop for the first file since I don't like adding unecessary condition checking.

You can also just modify printf if it's not the exact output you wanted.

Among other solutions actually, gawk's match could allow one to be specific on which parameters should really be targeted so the above code could also work if other lines would have slightly different format like other extra key/value pairs around.

UPDATE

This one's simpler but risks opening too many output files at once depending on the number of entries.

#!/usr/bin/gawk -f

BEGIN {
    f = ARGV[2]
    while (getline id < f) {
        ids[id] = 0
    }
    ARGV[2] = ""
}

match($0, /.*<AUDIT_RECORD.* (TIMESTAMP="[^"]*").* CONNECTION_ID="([^"]*)".* (SQLTEXT="[^"]*").*/, a) {
    id = a[2]
    if (id in ids) {
        printf "%s\n%s\n%s\n\n", "==========================================", a[1], a[3] > id ".txt"
    }
}

Upvotes: 2

user000001

Reputation: 33387

You should use the split function to separate the fourth word by the " character, so that you get the number that matches the values in file2. You should also a[$1] in the firth block.

This script should work:

awk 'NR==FNR{a[$1];next}{split($4,b,"\"")}(b[2] in a){print $2"\n"$6"\n=========\n" > b[2]".txt"}' file2  file1

UPDATE:

As long as there are not extra quotes in file1, we can use the " character as a field separator:

awk -F\" 'NR==FNR{a[$1];next}($6 in a){print "TIMESTAMP=\""$2"\"\nSQLTEXT=\""$10"\"\n=========\n" > $6".txt"}' file2  file1

We split the input file with the " character as a delimiter, so field $2 is the timestamp, field $6 is the conn. id, and filed $10 is the SQLTEXT.

The first block NR==FNR{a[$1];next} populates the array with the coonection ids from file2 ( NR == FNR for the first file). With ($6 in a) we filter the lines of the second file (since we called next in the first block), for which the connection id is an index of table a. If we find a match, then the block {print "TIMESTAMP=\""$2"\"\nSQLTEXT=\""$10"\"\n=========\n" > $6".txt"}' is executed, which prints the relevant information to the file conn_id.txt

Upvotes: 3

lurker

Reputation: 58324

This doesn't use awk but it works:

while read -r n
do
    echo "Generating $n.txt"
    grep $n file1 | sed 's/^.*\(TIMESTAMP="[^"]\+"\).*\(SQLTEXT="[^"]\+"\).*$/=======================\n\1\n\2\n/' > $n.txt

    # If you don't want an empty file when there's no match, add this line
    if [ ! -s $n.txt ]; then rm -f $n.txt ; fi
done < file2

Upvotes: 2

Split a file based on the string from the other file

1st TRY

Answers (3)

Related Questions