Mehdi
Mehdi

Reputation: 37

Using the appropriate SELECT statement to extract data from multiple tables

I'm trying to extract some data using SQL queries from different tables I have already created. The tables are as follows:

CREATE TABLE Candidates 
(
    Candidate_ID INTEGER PRIMARY KEY,
    Start_Year DATE,
    Course_Current_Status VARCHAR(18),  
    First_Name TEXT,
    Last_Name TEXT,
    Gender TEXT);



CREATE TABLE Subjects 
(
    Subject_Code INTEGER PRIMARY KEY,
    Subject_Name TEXT, 
    Subject_Credits INTEGER,
    Subject_Level INTEGER,
    TeacherID INTEGER REFERENCES Teachers(TeacherID)
);

CREATE TABLE Subjects-taken 
(
    Marks_Obtained INTEGER,
    Subject_Code INTEGER REFERENCES Subjects(Subject_Code),
    Candidate_ID INTEGER REFERENCES Students(Candidate_ID),
    Program_Year_When_Enrolled TEXT, 
    PRIMARY KEY(Subject_Code, Candidate_ID)
);

I want to show three columns with my query:

Candidate_ID, Total_first_Year_Marks, Total_fourth_Year_Marks, Overall_Marks 

For fourth year marks the Enrollment.Program_Year_When_Enrolled = 'Second' And for the fourth year marks the Enrollment.Program_Year_When_Enrolled = 'Third'. For the overall marks a new column would have to be created by the query i.e. Total Marks which would be 1/3 of the second year marks and 2/3 of the third year marks.

SELECT Candidates.Candidate_ID, AVG(Enrollment.Marks_obtained) AS avg_marks FROM Candidates LEFT JOIN Enrollment ON Candidates.Candidate_ID = Enrollment.Candidate_ID WHERE Enrollment.Program_Year_When_Enrolled = 'Third' AND Candidates.Course_Current_Status = 'Graduated' GROUP BY Candidates.Candidate_ID

Can someone please tell me what I'm doing wrong. Much appreciated. Thanks.

Upvotes: 1

Views: 412

Answers (3)

Thorsten Kettner
Thorsten Kettner

Reputation: 95033

You want to join students with their aggregated enrollment values. As 2017 is in the past, we don't need an outer join here. Each student will have got marks in the two years, otherwise - how could they have graduated? At least this is how I understand this.

One thing to be aware of when dividing is that PostgreSQL applies what's called an integer division, where decimals get swallowed, e.g. 3 / 2 = 1. One way to circumvent this to multiply with 1.0 for instance.

select
  s.student_id, 
  e.average_second_year_marks, 
  e.average_third_year_marks, 
  e.overall_marks
from students s
join
(
  select student_id,
    avg(case when program_year_when_enrolled = 'Second' then marks_obtained end)
      as average_second_year_marks,
    avg(case when program_year_when_enrolled = 'Third' then marks_obtained end)
      as average_third_year_marks,
    (
     (sum(case when program_year_when_enrolled = 'Second' then marks_obtained end) * 1.0) +
     (sum(case when program_year_when_enrolled = 'Third' then marks_obtained end) * 2.0)
    ) / 3.0 as overall_marks
  from enrollment
  group by student_id
) e on e.student_id = s.student_id
where s.course_current_status = 'Graduated-2017';

Upvotes: 0

PerformanceDBA
PerformanceDBA

Reputation: 33808

Problem

You need to learn and understand what an ordinary SQL Subquery is.

  • There are two types:
    • Scalar Subquery (Not Correlated)
    • Correlated Subquery
  • A Subquery can be used anywhere that an expression can be used.
  • Both are basic, essential facilities in ANSI SQL that we have had since SQL ANSI 1987.

  • Here is the current entry for Correlated Subquery:

ISO/IEC 9075-2:2008(E), page 1262 (Annex F):

45 E061-13 Correlated subqueries
— Subclause 8.1, /predicate/:
When a /correlation name/ can be used in a /table subquery/
as a correlated reference to a column in the outer query

Consideration

The type required for this is Correlated.

Imagine an outer grid (as in a spreadsheet). The rows are Student_ID, the columns are the three you requested. A normal or single SELECT means that the columns have the exact same relation to the row identifier (Student_ID). You were trying was getting all of it in a single SELECT, without realising that each column in the spreadsheet has a discrete and different relation to the row identifier.

Further, you were complicating the matter (confusing yourself) by using GROUP BY, which of course affects the whole SELECT (since you had just one): the GROUP BY is required for the AVG() but not the outer query.

Solution

First, you need a Subquery, because the columns do not relate to each other (they relate to the row identifier).

Second, you need a Correlated Subquery, which correlates the row identifier in the outer grid (Student_ID), which is the outer query, to the content of the column, which is the inner query.

  • This is a typical mistake that developers make, fiddling around with the one SELECT, trying to "make it work", because they do not realise the Problem is in their understanding and approach, that a Subquery is called for.

Code

SELECT  -- Student_ID,  -- humans don"t use numbers for identifiers
        Last_Name,
        First_Name,
        Course_Code,
        (SELECT  AVG( Marks_Obtained ) -- CS.A
            FROM Enrollment 
            WHERE Student_ID = S_OUTER.Student_ID
                AND Module_Year = "Second" 
            GROUP BY Enrollment.Student_ID
            ) AS Avg_2nd_Year_Marks,
        (SELECT  AVG( Marks_Obtained ) -- CS.B
            FROM Enrollment 
            WHERE Student_ID = S_OUTER.Student_ID
                AND Module_Year = "Third" 
            GROUP BY Enrollment.Student_ID
            ) AS Avg_3rd_Year_Marks,
        (SELECT  AVG( Marks_Obtained ) -- CS.A / 3 * 1
            FROM Enrollment 
            WHERE Student_ID = S_OUTER.Student_ID
                AND Module_Year = "Second" 
            GROUP BY Enrollment.Student_ID
            ) / 3 + 
        (SELECT  AVG( Marks_Obtained ) -- CS.B / 3 * 2
            FROM Enrollment 
            WHERE Student_ID = S_OUTER.Student_ID
                AND Module_Year = "Third" 
            GROUP BY Enrollment.Student_ID
             ) / 3 * 2 AS Overall_Marks
    FROM Students S_OUTER 
    WHERE Graduate_Year = "2017"  -- get current year using relevant date function
    ORDER BY Last_Name, First_Name

Explanation & Response to Questions

  1. CS means Correlated Subquery. The correlation between the outer query (alias S_OUTER), and the inner query, is via the alias.

  2. The product of the Subquery is a virtual (not real) column, which exists for the duration of the command execution. It is computed or generated (as distinct from created) by the SELECT. Such a column (non-existent in the database, generated on the fly) needs a column name, such as the following. It will appear in the result set, in your client program:

    • Avg_2nd_Year_Marks
    • Avg_3rd_Year_Marks
    • Overall_Marks
  3. The third column has to repeat the two subqueries because it has to perform arithmetic.

  4. The code uses standard ANSI SQL facilities (such as a Subquery) and syntax, it will work on any SQL-compliant platform (no need for CTEs, etc.)

    • Oracle cacks itself on subqueries (it is not SQL-complaint).
    • No idea whether Freeware such as postGresNONsql provides ordinary SQL Subqueries. It is well-known to be non-SQL-compliant, contrary to the marketing and the "SQL compliance" notice in every page in the manual. Massive fraud. The fact that most seekers on this site have not seen a Subquery, is an indicator that the freeware does not have it.
    • If it does not have the basic SQL Subquery facility, there is a way of using torturing the subquery code into Inline Views, and then hammering it with GROUP BY, in order to obtain a similar result. Very inefficient, but hey, the "platform" isn't a platform, it is by no means a server, it is already very inefficient and slow.

Should I create a column in the table with the name Avg_3rd_Year_Marks

Definitely not.

or is there any other code which could solve this matter out

Yes. A rather simple SQL Subquery, as prescribed from the begininng of this answer.

  • At this point we have found that your NON-sql NON-platform does not recognise = for assignment of acolumn name, computed or real. I have changed the syntax to use AS (both are standard SQL syntax).

  • If that does not work it means the freeware is [on yet another count] not SQL-complaint, not even for the SQL Basics.

would this query be possible if I use module_code and module_year as a primary composite key

  • It makes no difference whatsoever to the outer query or the subqueries, because it is not referenced (the average is across all Module_Years, all Module_Codes, for the Student_ID). Just the column name has changed.

  • It does make a difference in the Data Model, and for understanding. If that PK is the correct PK for Module, yes. That is correct IFF a given ModuleCode can be taken in more than one Module_Year.

  • We do not fix up the data to fit the requirement. We fix up the data to fit Reality. And then the code required is dead simple. Further, that eliminates ongoing changes to the data structures, because the structure of Reality does not change (the content does).

Changed course_current_year to course_current_status for it to make more sense.

What I want to do is extract data for students graduating in 2017 i.e. the Course_Current_Status = 'Graduated-2017'.

  • That does not make sense. You certainly do not want to hard-code = "2017" in the query, you want the current year. So get that from whatever function postGresNONsql has for getting the current date, and grab just the year.

  • It appears that you need the Graduation Year in Student. Not "current". In the query you want the Students that are graduating in the current year. I have changed the Data Model and code accordingly.

Note

In order to make sense of your question, I had to format your text, and erect a data model. If you are interested: Hadi Data Model. Data Model updated per changes and comments.

By the way, the naming in your tables is better than most, but still a bit crazy.

Upvotes: 2

Dr Phil
Dr Phil

Reputation: 880

Since you didn't specify the exact database it is hard to use conditional expressions for the sum. For example, MySQL has if function, SQL Server has select/case/when. Depending on the DB engine the syntax could be quite different.

If we use the syntax of subqueries we can go like this:

with student_year_grade(Student_ID,avg_marks,Course_Current_Year)
as (
SELECT 
    Students.Student_ID,
    AVG(Enrollment.Marks_obtained) AS avg_marks ,
    Course_Current_Year
FROM
    Students LEFT JOIN
    Enrollment ON Students.Student_ID=Enrollment.Student_ID 
WHERE 
    Enrollment.Program_Year_When_Enrolled = 'Third' 
GROUP BY 
    Students.Student_ID, Course_Current_Year
)
select Student_ID, 
   sum( if(Course_Current_Year='G',avg_marks,0) as current_mark,
   sum( if(Course_Current_Year='-1',avg_marks,0) as prev_mark,
   sum( if(Course_Current_Year='-2',avg_marks,0) as prev_mark

group by student ID 
from student_year_grade

you will have to replace -1 with the code for the previous year, and -2 with the one for the year before. and probably multiply by your 1/3 or 2/3 factor

Upvotes: 0

Related Questions