Reputation: 389

Python Dataframe to SQL Query

I have developed a Python script that reads a CSV file which is a result of a SQL query (just a select * from table) and I perform some transformations and calculations on that dataframe.

I get the dataframe using the following Python commands:

result=csv_df.sort_values(by=['column1','column2','column3'],ascending=True)
result=result.drop_duplicates(['column1','column2'])

Now I need to get the same table using a SQL Query. I have tried the following in T-SQL but I have not been succesful.

select * from data
    where column1 IN
    (select distinct column1,column2 from data)
 and 
    where column2 IN
    (select distinct column1,column2 from data)
    order by column1,column2;

I am new to SQL syntax, can someone help me with the query?

What I am trying to do is delete all the duplicated rows from the combination of column1 and column2.

In Python the reason I include column3 is because it has NULL values that I need to discard.

After this should I create a view to keep on performing calculations?

Upvotes: 3

Answers (3)

Parfait

Reputation: 107567

Assuming a unique ID in table, consider taking the record with lowest ID of matching column1 and column2 pairs:

SELECT * FROM data AS main
WHERE main.ID IN
    (SELECT sub.MinID FROM
       (SELECT column1, column2, Min(ID) As MinID
        FROM data
        GROUP BY column1, column2) AS sub)
ORDER BY main.column1, main.column2;

Alternatively, with JOIN:

SELECT main.* FROM data AS main
INNER JOIN 
    (SELECT column1, column2, Min(ID) As MinID
     FROM data
     GROUP BY column1, column2) AS sub
ON main.ID = sub.MinID
ORDER BY main.column1, main.column2;

Even still, with EXISTS:

SELECT main.* FROM data AS main
WHERE EXISTS
  (SELECT 1 FROM 
      (SELECT column1, column2, Min(ID) As MinID
       FROM data
       GROUP BY column1, column2) sub
   WHERE main.ID = sub.MinID)
ORDER BY main.column1, main.column2;

And a non-Window function query using correlated count subquery (for potential compatibility with MySQL, SQLite, and MS Access). This version leaves out records if any of two columns are NULL:

SELECT * FROM 
  (SELECT *, 
      (SELECT Count(*) FROM data sub
       WHERE sub.ID <= data.ID
       AND sub.column1 = data.column1
       AND sub.column2 = data.column2) AS rn
   FROM data) AS main
WHERE main.rn = 1

Upvotes: 2

MaxU - stand with Ukraine

Reputation: 210822

If I understood your question correctly, you can do it using ROW_NUMBER() function:

with VirtTab as (
    select
      t.*,
      row_number()
        over(partition by column1, column2 order by column1, column2) as rn
    from data t
)
select * from VirtTab
where rn = 1
order by column1, column2;

Upvotes: 0

Abhishek Anand

Reputation: 195

From what I understand you require all records ordered by column1, column2 and column3:

Select * from data order by column1,column2,column3

Now, on top of this you want to remove duplicate rows in columns participantObjectId and slipObjectId. First partition columns on the basis of participantObjectId and slipObjectId. The query below wraps on top of the above query and adds another field id which provides unique values for each row of the data.

select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id
from (select * from data order by column1,column2,column3)

On top of this we add another select statement with an added condition to choose only those rows with id equal to 1.

select * from
(select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id from 
(Select * from data order by column1,column2,column3
)) where id=1;

Upvotes: 0

Python Dataframe to SQL Query

Answers (3)

Related Questions