stats_noob
stats_noob

Reputation: 5907

SQL: Deleting Identical Columns With Different Names

My original table ("original_table") looks like this (contains both numeric and character variables):

  age height height2 gender gender2
1  18   76.1    76.1      M       M
2  19   77.0    77.0      F       F
3  20   78.1    78.1      M       M
4  21   78.2    78.2      M       M
5  22   78.8    78.8      F       F
6  23   79.7    79.7      F       F

I would like to remove columns from this table that have identical entries, but are named differently. In the end, this should look like this ("new_table"):

  age height gender
1  18   76.1      M
2  19   77.0      F
3  20   78.1      M
4  21   78.2      M
5  22   78.8      F
6  23   79.7      F

My Question: Is there a standard way to do this in SQL? I tried to do some research and came across the following link : How do I compare two columns for equality in SQL Server?

What I Tried So Far: It seems that something like this might work:

CREATE TABLE new_table AS SELECT * FROM original_table;

ALTER TABLE new_table
ADD does_age_equal_height varchar(255);

    UPDATE new_table 

SET     does_age_equal_height =  CASE  
WHEN age = height THEN '1' ELSE '0' END AS does_age_equal_height;
                  

From here, if the "sum" of all values in the "does_age_equal_height" column equals to the number of rows from "new_table" (i.e. select count(rownum) from new_table) - this must mean that both columns are equal, and that one of the columns can be dropped.

However, this is a very inefficient method, even for tables having a small number of columns. In my example, I have 5 columns - this means that I would have to repeat the above process " 5C2" times, i.e. 5! / (2!*3!) = 10 times. For example:

ALTER TABLE employees
  ADD does_age_equal_height varchar(255),
     does_age_equal_height2 varchar(255)
does_age_equal_gender varchar(255)
does_age_equal_gender2 varchar(255)
does_height_equal_height2 varchar(255)
does_height_equal_gender varchar(255)
does_height_equal_gender2 varchar(255)
does_height2_equal_gender varchar(255)
does_height2_equal_gender2 varchar(255)
does_gender_equal_gender2 varchar(255);

This would then be followed by multiple CASE statements - further complicating the process.

Can someone please show me a more efficient way of doing this?

Thanks!

Upvotes: 0

Views: 110

Answers (1)

Farshid Shekari
Farshid Shekari

Reputation: 2449

I hope to get your problem in the right way. This is my code in SqlServer to handle it, you should customize it based on Netezza SQL.

My idea is:
Calculate MD5 for each column and then compare these columns together, if there is the same hash, one of the columns will be chosen.

I going to create the below table for this problem:

CREATE TABLE Students
(
    Id INT PRIMARY KEY IDENTITY,
    StudentName VARCHAR (50),
    Course VARCHAR (50),
    Score INT,
    lastName VARCHAR (50) -- another alias for StudentName ,
    metric INT, -- another alias for score
    className VARCHAR(50) -- another alias for Course 
)
GO

INSERT INTO Students VALUES ('Sally', 'English', 95, 'Sally', 95, 'English');
INSERT INTO Students VALUES ('Sally', 'History', 82, 'Sally', 82, 'History');
INSERT INTO Students VALUES ('Edward', 'English', 45, 'Edward', 45, 'English');
INSERT INTO Students VALUES ('Edward', 'History', 78, 'Edward', 78, 'History');

after creating the table and inserting sample records, it turns to find similar columns.

step 1. Declare variables.

DECLARE @cols_q VARCHAR(max), 
        @cols VARCHAR(max), 
        @table_name  VARCHAR(max)= N'Students', 
        @res  NVARCHAR(max),
        @newCols VARCHAR(max),
        @finalResQuery VARCHAR(max);

step 2. Generate dynamics query for calculating a hash for every column.

SELECT @cols_q = COALESCE(@cols_q+ ', ','')+'HASHBYTES(''MD5'',  CONVERT(varbinary(max), (select '+ COLumn_NAME +' as t  from Students FOR XML AUTO))) as '+ COLumn_NAME,
        @cols = coalesce(@cols + ',','')+COLumn_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = @table_name; 
 set  @cols_q = 'select '+  @cols_q +' into ##tmp_'+ @table_name+' from '+ @table_name;

step 3. Run generated query.

 exec(@cols_q)

step 4. Get columns that duplicated columns removed.

  set @res = N'select  uniq_colname into ##temp_colnames
  from(
  select  max(colname) as uniq_colname from (
     select * from ##tmp_Students 
 )tt
 unpivot ( 
    md5_hash for colname in ( '+ @cols +')
 ) as tbl
 group by md5_hash
 )tr';
   exec ( @res);

step 5. Get final results

select @newCols  = COALESCE(@newCols+ ', ','')+ uniq_colname from ##temp_colnames

 set @finalResQuery = 'select '+ @newCols +' from '+ @table_name;
 exec (@finalResQuery)

Upvotes: 1

Related Questions