Reputation: 5907
My original table ("original_table") looks like this (contains both numeric and character variables):
age height height2 gender gender2
1 18 76.1 76.1 M M
2 19 77.0 77.0 F F
3 20 78.1 78.1 M M
4 21 78.2 78.2 M M
5 22 78.8 78.8 F F
6 23 79.7 79.7 F F
I would like to remove columns from this table that have identical entries, but are named differently. In the end, this should look like this ("new_table"):
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
My Question: Is there a standard way to do this in SQL? I tried to do some research and came across the following link : How do I compare two columns for equality in SQL Server?
What I Tried So Far: It seems that something like this might work:
CREATE TABLE new_table AS SELECT * FROM original_table;
ALTER TABLE new_table
ADD does_age_equal_height varchar(255);
UPDATE new_table
SET does_age_equal_height = CASE
WHEN age = height THEN '1' ELSE '0' END AS does_age_equal_height;
From here, if the "sum" of all values in the "does_age_equal_height" column equals to the number of rows from "new_table" (i.e. select count(rownum) from new_table
) - this must mean that both columns are equal, and that one of the columns can be dropped.
However, this is a very inefficient method, even for tables having a small number of columns. In my example, I have 5 columns - this means that I would have to repeat the above process " 5C2" times, i.e. 5! / (2!*3!) = 10 times. For example:
ALTER TABLE employees
ADD does_age_equal_height varchar(255),
does_age_equal_height2 varchar(255)
does_age_equal_gender varchar(255)
does_age_equal_gender2 varchar(255)
does_height_equal_height2 varchar(255)
does_height_equal_gender varchar(255)
does_height_equal_gender2 varchar(255)
does_height2_equal_gender varchar(255)
does_height2_equal_gender2 varchar(255)
does_gender_equal_gender2 varchar(255);
This would then be followed by multiple CASE statements - further complicating the process.
Can someone please show me a more efficient way of doing this?
Thanks!
Upvotes: 0
Views: 110
Reputation: 2449
I hope to get your problem in the right way. This is my code in SqlServer to handle it, you should customize it based on Netezza SQL.
My idea is:
Calculate MD5 for each column and then compare these columns together, if there is the same hash, one of the columns will be chosen.
I going to create the below table for this problem:
CREATE TABLE Students
(
Id INT PRIMARY KEY IDENTITY,
StudentName VARCHAR (50),
Course VARCHAR (50),
Score INT,
lastName VARCHAR (50) -- another alias for StudentName ,
metric INT, -- another alias for score
className VARCHAR(50) -- another alias for Course
)
GO
INSERT INTO Students VALUES ('Sally', 'English', 95, 'Sally', 95, 'English');
INSERT INTO Students VALUES ('Sally', 'History', 82, 'Sally', 82, 'History');
INSERT INTO Students VALUES ('Edward', 'English', 45, 'Edward', 45, 'English');
INSERT INTO Students VALUES ('Edward', 'History', 78, 'Edward', 78, 'History');
after creating the table and inserting sample records, it turns to find similar columns.
step 1. Declare variables.
DECLARE @cols_q VARCHAR(max),
@cols VARCHAR(max),
@table_name VARCHAR(max)= N'Students',
@res NVARCHAR(max),
@newCols VARCHAR(max),
@finalResQuery VARCHAR(max);
step 2. Generate dynamics query for calculating a hash for every column.
SELECT @cols_q = COALESCE(@cols_q+ ', ','')+'HASHBYTES(''MD5'', CONVERT(varbinary(max), (select '+ COLumn_NAME +' as t from Students FOR XML AUTO))) as '+ COLumn_NAME,
@cols = coalesce(@cols + ',','')+COLumn_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = @table_name;
set @cols_q = 'select '+ @cols_q +' into ##tmp_'+ @table_name+' from '+ @table_name;
step 3. Run generated query.
exec(@cols_q)
step 4. Get columns that duplicated columns removed.
set @res = N'select uniq_colname into ##temp_colnames
from(
select max(colname) as uniq_colname from (
select * from ##tmp_Students
)tt
unpivot (
md5_hash for colname in ( '+ @cols +')
) as tbl
group by md5_hash
)tr';
exec ( @res);
step 5. Get final results
select @newCols = COALESCE(@newCols+ ', ','')+ uniq_colname from ##temp_colnames
set @finalResQuery = 'select '+ @newCols +' from '+ @table_name;
exec (@finalResQuery)
Upvotes: 1