Reputation: 389
I have developed a Python script that reads a CSV file which is a result of a SQL query (just a select * from table
) and I perform some transformations and calculations on that dataframe.
I get the dataframe using the following Python commands:
result=csv_df.sort_values(by=['column1','column2','column3'],ascending=True)
result=result.drop_duplicates(['column1','column2'])
Now I need to get the same table using a SQL Query. I have tried the following in T-SQL but I have not been succesful.
select * from data
where column1 IN
(select distinct column1,column2 from data)
and
where column2 IN
(select distinct column1,column2 from data)
order by column1,column2;
I am new to SQL syntax, can someone help me with the query?
What I am trying to do is delete all the duplicated rows from the combination of column1
and column2
.
In Python the reason I include column3
is because it has NULL
values that I need to discard.
After this should I create a view to keep on performing calculations?
Upvotes: 3
Views: 739
Reputation: 107567
Assuming a unique ID in table, consider taking the record with lowest ID of matching column1 and column2 pairs:
SELECT * FROM data AS main
WHERE main.ID IN
(SELECT sub.MinID FROM
(SELECT column1, column2, Min(ID) As MinID
FROM data
GROUP BY column1, column2) AS sub)
ORDER BY main.column1, main.column2;
Alternatively, with JOIN
:
SELECT main.* FROM data AS main
INNER JOIN
(SELECT column1, column2, Min(ID) As MinID
FROM data
GROUP BY column1, column2) AS sub
ON main.ID = sub.MinID
ORDER BY main.column1, main.column2;
Even still, with EXISTS
:
SELECT main.* FROM data AS main
WHERE EXISTS
(SELECT 1 FROM
(SELECT column1, column2, Min(ID) As MinID
FROM data
GROUP BY column1, column2) sub
WHERE main.ID = sub.MinID)
ORDER BY main.column1, main.column2;
And a non-Window function query using correlated count subquery (for potential compatibility with MySQL, SQLite, and MS Access). This version leaves out records if any of two columns are NULL
:
SELECT * FROM
(SELECT *,
(SELECT Count(*) FROM data sub
WHERE sub.ID <= data.ID
AND sub.column1 = data.column1
AND sub.column2 = data.column2) AS rn
FROM data) AS main
WHERE main.rn = 1
Upvotes: 2
Reputation: 210822
If I understood your question correctly, you can do it using ROW_NUMBER() function:
with VirtTab as (
select
t.*,
row_number()
over(partition by column1, column2 order by column1, column2) as rn
from data t
)
select * from VirtTab
where rn = 1
order by column1, column2;
Upvotes: 0
Reputation: 195
From what I understand you require all records ordered by column1, column2 and column3:
Select * from data order by column1,column2,column3
Now, on top of this you want to remove duplicate rows in columns participantObjectId and slipObjectId. First partition columns on the basis of participantObjectId and slipObjectId. The query below wraps on top of the above query and adds another field id which provides unique values for each row of the data.
select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id
from (select * from data order by column1,column2,column3)
On top of this we add another select statement with an added condition to choose only those rows with id equal to 1.
select * from
(select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id from
(Select * from data order by column1,column2,column3
)) where id=1;
Upvotes: 0