Reputation: 844

Natural Join -- Relational theory and SQL

This question comes from my readings of C.J Date's SQL and Relational Theory: How to Write Accurate SQL Code and looking up about joins on the internet (which includes coming across multiple posts here on NATURAL JOINs (and about SQL Server's lack of support for it))

So here is my problem...

On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred).

On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction).

Is the reconciliation of these that:

Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS.

and / or

Good/Better table design should remove/minimise the problems that natural join creates.

Upvotes: 2

Answers (4)

onedaywhen

Reputation: 57023

First, the choice between theory and being practical is a fallacy. To quote Chris Date: "the truth is that theory--at least the theory I'm talking about here, which is relational theory--is most definitely very practical indeed".

Second, consider that natural join relies on attribute naming. Please (re)read the following sections of the Accurate SQL Code book:

6.12. The Reliance on Attribute Names. Salient quote:

The operators of the relational algebra… all rely heavily on attribute naming.

3.9. Column Naming in SQL. Salient quote:

Strong recommendation: …if two columns in SQL represent "the same kind of information," give them the same name wherever possible. (That's why, for example, the two supplier number columns in the suppliers-and-parts database are both called SNO and not, say, SNO in one table and SNUM in the other.) Conversely, if two columns represent different kinds of information, it's usually a good idea to give them different names.

I'd like to address @kuru kuru pa's point (a good one too) about columns being added to a table over which you have no control, such as a "web service you're consuming." It seems to me that this problem is effectively mitigated using the strategy suggested by Date in section 3.9 (referenced above): quote:

For every base table, define a view identical to that base table except possibly for some column renaming.
Make sure the set of views so defined abides by the column naming discipline described above.
Operate in terms of those views instead of the underlying base tables.

Personally, I find the "natural join considered dangerous" attitude frustrating. Not wishing to sound self-righteous but my own naming convention, which follows the guidance of ISO 11179-5 Naming and identification principles, results in schema highly suited to natural join.

Sadly, natural join perhaps won't be supported anytime soon in the DBMS product I use professionally (SQL Server): the relevant feature request on Microsoft Connect ~~is currently closed as "won't fix" despite currently having a respectable +38 / -2 score~~ has been reopened and gained a respectable 46 / -2 score (go vote for it now :)

Upvotes: 6

nvogel

Reputation: 25524

The main problem with the NATURAL JOIN syntax in SQL is that it is typically too verbose.

In Tutorial D syntax I can very simply write a natural join as:

R{a,b,c} JOIN S{a,c,d};

But in SQL the SELECT statement needs either derived table subqueries or a WHERE clause and aliases to achieve the same thing. That's because a single "SELECT statement" is really a non-relational, compound operator in which the component operations always happen in a predetermined order. Projection comes after joins and columns in the result of a join don't necessarily have unique names.

E.g. the above query can be written in SQL as:

SELECT DISTINCT a, b, c, d
FROM
(SELECT a,b,c FROM R) R
NATURAL JOIN
(SELECT a,c,d FROM S) S;

or:

SELECT DISTINCT R.a, R.b, R.c, S.d
FROM R,S
WHERE R.a = S.a AND R.c = S.c;

People will likely prefer the latter version because it is shorter and "simpler".

Upvotes: 2

Erwin Smout

Reputation: 18408

a number of points regarding your question (even if I'm afraid I'm not really answering anything you asked),

"On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred)."

This seems to suggest that you interpret theory as if it proscribes against "other kinds" of joins ... That is not really true. Relational theory does not say "you cannot have antijoins", or "you should never use antijoins", or anything like that. What it DOES say, is that in the relational algebra, a set of primitive operators can be identified, in which natural join is the only "join-like" operator. All other "join-like" operators, can always be expressed equivalently in terms of the primitive operators defined. Cartesian product, for example, is a special case of a natural join (where the set of common attributes is empty), and if you want the cartesian product of two tables that do have an attribute name in common, you can address this using RENAME. Semijoin, for example, is the natural join of the first table with some projection on the second. Antijoin, for example (SEMIMINUS or NOT MATCHING in Date's book), is the relational difference between the first table and a SEMIJOIN of the two. etc. etc.

"On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction)."

Where are such things advised ? In the SQL standard ? I don't really think so. It is important to distinguish between the SQL language per se, which is defined by an ISO standard, and some (/any) particular implementation of that language, which is built by some particular vendor. If Microsoft advises its customers to not use NJ in SQL Server 200x, then that advice has a completely different meaning than an advice by someone to not ever use NJ in SQL altogether.

"Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS."

While it is true that SQL per se fails to faithfully comply with relational theory, that actually has very little to do with the question of NJ.

Whether an implementation gives good performance for invocations of NJ, is a characteristic of that implementation, not of the language, or of the "degree of trueness" of the 'R' in 'RDBMS'. It is very easy to build a TRDBMS that doesn't use SQL, and that gives ridiculous execution times for NJ. The SQL language per se has everything that is needed to support NJ. If an implementation supports NJ, then NJ will work in that implementation too. Whether it gives good performance, is a characteristic of that implementation, and poor performance of some particular implementation should not be "extrapolated" to other implementations, or be seen as a characteristic of the SQL language per se.

"Good/Better table design should remove/minimise the problems that natural join creates."

Problems that natural join creates ? Controlling the columns that appear in the arguments to a join is easily done by adding explicit projections (and renames if needed) on the columns you want. Much like you also want to avoid SELECT * as much as possible, for basically the same reason ...

Upvotes: 6

Chains

Reputation: 13157

Theory versus reality...

Natural joins are not practical.
There is no such thing as a pure (i.e. practice is idetical to theory) RDBMS, as far as I know.

I think Oracle and a few others actually support support natural joins -- TSQL doesn't.

Consider the world we live in -- chances of two tables each having a column with the same name is pretty high (like maybe [name] or [id] or [date], etc.). Maybe those chances are narrowed down a bit by grouping only those tables you might actually want to join. But regardless, without a careful examination of the table structure, you won't know if a "natural join" is a good idea or not. And even if it is, at that moment, it might not be in another year when the application gets an upgrade which adds columns to certain tables, etc., or the web service you're consuming adds fields you didn't know about, etc.

I think a "pure" system would have to be one you had 100% control over at a minimum, and then also, one that would have some good validation in the alter table / create table process that would warn / prevent you from creating a new column in some table that could be "naturally" joined to some other table you might not be intending it to be join-able to.

I guess bottom-line for me would be, valuing my sanity, wanting my applications to have maximum up-time, valuing quick/clean maintenance and upgrades, etc. -- good table design in this context means not using natural joins (ever).

Upvotes: -2

Natural Join -- Relational theory and SQL

Answers (4)

Related Questions