Reputation: 1336
I have two files that look like this. Both files are sorted on the 1st then 2nd field. (There can be multiple rows for an ID)
file a
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337313|cb1c00eeccb25ea5a069da63a1b0c2565379ff9c|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
3337320|1e3126f41f848509efad0b3415b003704377778c|1
file b
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337315|780055f13efffcb4bee115c6cf546af85ac6c0a7|1
3337316|19535297b9913b6bca1796b68505498d5e81b5ed|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
The first row is a key; 3 fields, pipe separated. Files are roughly 1gb.
What I'd like to do is get back a result set that looks like:
3333 rows in File A
4444 rows in File B
1234 rows are identical
2345 rows are different (aka the 2nd/3rd field are different but the key matches)
111 rows in File A not in File B
222 rows in File B not in File A
Here's SQL code that accomplishes it, and that's my fallback.
--CREATE TABLE aws_hash_compare (the_filename VARCHAR(100) NOT NULL, switch_id BIGINT, hash_value CHAR(40),the_count TINYINT)
--CREATE UNIQUE CLUSTERED INDEX ucidx__awshashcompare__the_filename__switch_id ON aws_hash_compare(the_filename, switch_id)
DECLARE @mSsql_filename sysname = 'FileA'
DECLARE @mYsql_filename sysname = 'FileB'
SELECT COUNT(*) AS MSSQL FROM aws_hash_compare
WHERE the_filename = @mSsql_filename
SELECT COUNT(*) AS MYSQL FROM aws_hash_compare
WHERE the_filename = @mYsql_filename
SELECT COUNT(*) AS switch_id_match FROM aws_hash_compare mysql
INNER JOIN aws_hash_compare mssql
ON mysql.the_filename = @mYsql_filename
AND mssql.the_filename = @mSsql_filename
AND mysql.switch_id = mssql.switch_id
SELECT COUNT(*) AS complete_match FROM aws_hash_compare mysql
INNER JOIN aws_hash_compare mssql
ON mysql.the_filename = @mYsql_filename
AND mssql.the_filename = @mSsql_filename
AND mysql.switch_id = mssql.switch_id
AND mssql.hash_value = mysql.hash_value
AND mssql.the_count = mysql.the_count
SELECT COUNT(*) AS hash_differences FROM aws_hash_compare mysql
INNER JOIN aws_hash_compare mssql
ON mysql.the_filename = @mYsql_filename
AND mssql.the_filename = @mSsql_filename
AND mysql.switch_id = mssql.switch_id
AND (mssql.hash_value <> mysql.hash_value OR mssql.the_count <> mysql.the_count)
SELECT COUNT(*) AS missing_from_MSSQL FROM aws_hash_compare mysql WHERE the_filename = @mYsql_filename
AND NOT EXISTS (SELECT 1 FROM aws_hash_compare mssql WHERE the_filename = @mSsql_filename
AND mssql.switch_id = mysql.switch_id)
SELECT COUNT(*) AS missing_from_MYSQL FROM aws_hash_compare mssql WHERE the_filename = @mSsql_filename
AND NOT EXISTS (SELECT 1 FROM aws_hash_compare mysql WHERE the_filename = @mYsql_filename
AND mssql.switch_id = mysql.switch_id)
Upvotes: 1
Views: 166
Reputation: 247062
Here's a version that uses comm
to compare the files, then awk to generate the results. It might be slower but may use less memory. comm
requires that its input files be sorted.
I'm assuming that a key appears exactly once per file.
comm filea fileb | awk -F'\t' '
BEGIN { na = nb = identical = common = 0 }
$1 {
split($1, f, /[|]/)
if (f[1] in b) {common++; delete b[f[1]]} else {a[f[1]]}
na++
}
$2 {
split($2, f, /[|]/)
if (f[1] in a) {common++; delete a[f[1]]} else {b[f[1]]}
nb++
}
$3 {
identical++
na++
nb++
}
END {
printf "%d rows in file A\n", na
printf "%d rows in file B\n", nb
printf "%d rows are identical\n", identical
printf "%d rows are different but share a key\n", common
printf "%d rows in file A only\n", length(a)
printf "%d rows in file B only\n", length(b)
}
'
Upvotes: 1
Reputation: 133700
Following awk could help you in same.
awk -F"|" '
FNR==NR{
a[$0]=$0;
b[$1];
next
}
FNR==1{
file1_count=(NR-1) " rows in " ARGV[1]
}
($1 in b) && !($0 in a){
first_field_matching++
}
($0 in a){
common++;
delete a[$0];
next
}
{
found_in_B_not_in_A++
}
END{
found_in_A_not_in_B=length(a);
print file1_count RS FNR " rows in " ARGV[2] RS common " rows are identical" \
RS first_field_matching " rows are different (aka the 2nd/3rd field are\
different but the key matches)" RS found_in_A_not_in_B " rows in File A\
not in File B" RS found_in_B_not_in_A " rows in File B not in File A"
}
' file_A file_B
Let's say following are File A and File B(I have done a minor change into your provided Input_files in order to verify one condition where $1 is same NOT others).
cat file_A
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337313|cb1c00eeccb25ea5a069da63a1b0c2565379ff9c|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|786af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
3337320|1e3126f41f848509efad0b3415b003704377778c|1
cat file_B
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337315|780055f13efffcb4bee115c6cf546af85ac6c0a7|1
3337316|19535297b9913b6bca1796b68505498d5e81b5ed|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
Now when we run above code then following will be the output on same.
5 rows in file_A
5 rows in file_B
2 rows are identical
1 rows are different (aka the 2nd/3rd field aredifferent but the key matches)
3 rows in File Anot in File B
3 rows in File B not in File A
Upvotes: 2