For a group of ids pick out distinct ids in another column and count

Question

Im working on a project at my University and its really important I get this right. I think I have it working but I had to learn SQL in the last few days to do this and I want to post it here to double check with you all who know more about SQL than I do.

Here is a small example to test

index       BorrowerID  CompanyID   PackageID 
----------  ----------  ----------  ----------
0           1           100         10        
1           1           200         10        
2           2           300         20        
3           2           300         20        
4           2           400         20        
5           2           100         20        
6           3           400         30        
7           3           100         30        
8           3           200         30        
9           1           100         40

Goal:

We want to know the number of times a company loans to a borrower. In syndicated loans there may be multiple loans in one package, therefore those only count once per package. A company may loan multiple packages to the same borrower, these would need to increment the count. Ideally we would have a dataset that had the borrowers that had relationships with companies, defined as having received at least one loan from a company in a package. Here is what the result of the test should be:

BorrowerID  CompanyID   testResults
----------  ----------  -----------
1           100         2          
1           200         1          
2           100         1          
2           300         1          
2           400         1          
3           100         1          
3           200         1          
3           400         1

BorrowerID 1 borrowed from bank 100 two different times. Once in package 10 and then again in package 40. The result shows 2 as the count which is correct. All the others are correct since they only had one relationship with a bank.

This is the query that I have run to accomplish this:

.open testdb.db
.mode column
.headers on
select * from testdb;

drop table if exists innerQ;

create table innerQ
AS select PackageID, BorrowerID, CompanyID, count(*) as c 
from testdb 
group by PackageID, CompanyID
order by PackageID asc;

.print
.print 'inner query'
select * from innerQ;

.print
.print 'test'
select BorrowerID, CompanyID, count(*) as testResults
from innerQ
group by BorrowerID, CompanyID
order by BorrowerID asc;

I really want to know if I let this loose on the whole dataset, which is structured the same way as the test, will it do the right thing? I would like to know if it will let me down or not, since on my simple test, and multiple others that I have run it has correctly passed all tests, but some others in my research group have some different results which I am strongly suspecting are wrong but not 100% sure. I would appreciate if you all could weigh in.

CL. · Accepted Answer

The documentation says:

In any aggregate function that takes a single argument, that argument can be preceded by the keyword DISTINCT. In such cases, duplicate elements are filtered before being passed into the aggregate function. For example, the function "count(distinct X)" will return the number of distinct values of column X instead of the total number of non-null values in column X.

So you can do this directly with a single query:

SELECT BorrowerID,
       CompanyID,
       count(DISTINCT PackageID) AS testResults
FROM testdb
GROUP BY BorrowerID,
         CompanyID
ORDER BY BorrowerID,
         CompanyID;

For a group of ids pick out distinct ids in another column and count

Answers (1)

Related Questions