Reputation: 337

SQL groupby having count distinct

I've got a postgres database that contains a table with IP, User, and time fields. I need a query to give me the complete set of all IPs that have only a single user active on them over a defined time period (i.e. I need to filter out IPs with multiple or no users, and should only have one row per IP). The user field contains some null values, that I can filter out. I'm using Pandas' read_sql() method to get a dataframe directly.

I can get the full dataframe of data from the defined time period easily with:

SELECT ip, user FROM table WHERE user IS NOT NULL AND time >= start AND time <= end

I can then take this data and wrangle the information I need out of it easily using pandas with groupby and filter operations. However, I would like to be able to get what I need using a single SQL query. Unfortunately, my SQL chops ain't too hot. My first attempt below isn't great; the dataframe I end up with isn't the same as when I create the dataframe manually using the original query above and some pandas wrangling.

SELECT DISTINCT ip, user FROM table WHERE user IS NOT NULL AND ip IN (SELECT ip FROM table WHERE user IS NOT NULL AND time >= start AND time <= end GROUP BY ip HAVING COUNT(DISTINCT user) = 1)

Can anyone point me in the right direction here? Thanks.

edit: I neglected to mention that there are multiple entries for each user/ip combination. The source is network authentication traffic, and users authenticate on IPs very frequently.

Sample table head:

---------------------------------
ip         | user    | time
---------------------------------
172.18.0.0 | jbloggs | 1531987000
172.18.0.0 | jbloggs | 1531987100
172.18.0.1 | jsmith  | 1531987200
172.18.0.1 | jbloggs | 1531987300
172.18.0.2 | odin    | 1531987400

If I were to query this example table for the time range 1531987000 to 1531987400 I would like the following output:

---------------------
ip         | user 
--------------------
172.18.0.0 | jbloggs
172.18.0.2 | odin

Upvotes: 0

Answers (3)

MikeFenton

Reputation: 337

I have figured out a query that gets me what I want:

SELECT DISTINCT ip, user 
FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end AND ip IN 
(SELECT ip FROM table 
WHERE user IS NOT NULL AND time >= start AND time <= end 
GROUP BY ip HAVING COUNT(DISTINCT user) = 1)

Explanation:

The inner select gets me all IPs that have only one user across the specified time range. I then need to select the distinct ip/user pairs from the main table where the IPs are in the nested select.

It seems messy that I have to do the same filtering (of time range and non-null user fields) twice though, is there a better way to do this?

Upvotes: 0

Gordon Linoff

Reputation: 1269793

If by "single user" you mean that there could be multiple rows with only one user, then:

SELECT ip
FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end
GROUP BY ip
HAVING MIN(user) = MAX(user) AND COUNT(user) = COUNT(*);

Upvotes: 0

Shiva

Reputation: 2838

This should work

SELECT ip
FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end
GROUP BY ip
HAVING COUNT(ip) = 1

Explanation:
SELECT ip FROM table WHERE user IS NOT NULL AND time >= start AND time <= end - filtering out the nulls and time periods

...GROUP BY ip HAVING COUNT(ip) = 1 - If an ip has multiple users, the count(no. of rows with that ip) would be greater > 1.

Upvotes: 1

SQL groupby having count distinct

Answers (3)

Related Questions