Reputation: 1965
I have a table with the following structure:
WorkerPersons
-------------------------------
ID (PK)
PersonID (Indicates which version of Person the record describes)
SomeColumn1 (data specific to Worker)
SomeColumn2 (data specific to Person)
....
SomeColumnN
-------------------------------
As you can see, it's a denormalized table, which holds both Worker and Person (and many versions of one Person) data in one table. My wish is to normalize that table, however, as the table holds a lot of data (many many columns), I need to be sure which columns should go to Workers table and which columns to Persons table. The outcome should be like this:
Workers Persons
----------------------- ---------------------
ID ID
PersonID (now a FK) PersonColumn1
WorkerColumn1 PersonColumn2
WorkerColumn2 ...
... PersonColumnN
WorkerColumnN
----------------------- ---------------------
To do that, I need to analyze which data differs in scope of Person over all unique Persons (wich are separated by PersonID in WorkerPersons). For example:
WorkerPersons
-------------------------------------------------------
ID PersonID Column1 Column2 Column3
-------------------------------------------------------
1 PersonA 10.1 John Doe Single
2 PersonA 10.1 John Doe Single
3 PersonA 10.1 John Doe Married
4 PersonB 09.2 Sully Single
5 PersonB 09.2 Sullivan Single
In this case, there are 3 versions on PersonA and 2 versions of PersonB. Column1 values are always the same over all versions of Person, and we can move that column to table Worker. But Column 2 and Column3 values change over different versions of Person, so those values should be moved to Person table.
No imagine, I have about 10 tables like this that need to be normalized, with about 40 columns in each. Eeach table holds about 500k to 5m rows.
I need a script that helps me analyse which columns to move where. I need a script that outputs all columns that change in scope of unique Person over the whole table. I've no ideas however how to do that. I experimented with LAG analytical function to compare against the next row but how in the world to output changed columns is beyond me.
Please advise.
Best wishes, Andrew
Upvotes: 3
Views: 2447
Reputation: 1965
Thanks, but I solved it by letting Excel create series of selects over table schema information. The final query that it generated was a long list of selects but it works (although it run over a hour). The "core query" (actually a formula in Excel to create to core query):
=IF(AND(C17<>"CLOB";C17<>"NCLOB");"SELECT '"&A17&".'||initcap('"&B17&"') description,
decode(count(*),0,'SAME OVE VERSIONS','DIFFERENT OVER VERSIONS') values FROM (SELECT
objektid, count(DISTINCT nvl("&B17&","&IF(C17="DATE";"'01.02.0004'";IF(C17="VARCHAR2"
;"'!#¤¤%¤(%#¤%AS'";"-1234561"))&")) OVER (PARTITION BY objectid) arv FROM "&A17&")
WHERE number > 1 union all";"SELECT '"&A17&".'||initcap('"&B17&"') description, 'CLOB
field' values from dual union all")
Upvotes: 1
Reputation: 12704
Since 10 tables is not a lot, here is (some sort of) pseudo code
for each table_name in tables
for each column_name in columns
case (exists (select 1
from table_name
group by PersonID
having min(column_name) = max(column_name))
when true then 'Worker'
when false then 'Person'
end case
end for
end for
with information schema and dynamic queries you could make the above proper PL/SQL or take the core query and script it in your favourite language.
EDIT:
The above assumes no NULL
s in column_name
.
EDIT2: Other variants of the core query can be
SELECT 1
FROM
(SELECT COUNT(DISTINCT column_name) AS distinct_values_by_pid
FROM table_name
GROUP BY PersonID) T
HAVING MIN(distinct_values_by_pid) = MAX(distinct_values_by_pid)
Which will return a row if all values per PersonID are the same. (this query also has problems with NULLS, but I consider NULLs a separate issue; you can always cast a NULL to some out-of-domain value for purposes of the above query)
The above query can be also written as
SELECT MIN(c1)=MAX(c1), MIN(c2)=MAX(c2), ...
FROM
(SELECT COUNT(DISTINCT column_name_1) AS c1, COUNT(DISTINCT column_name_2) AS c2, ...
FROM table_name
GROUP BY PersonID) T
Which will test multiple columns at the same time returning true for columns that belong to 'Workers' and false for columns that should go into 'Persons'.
Upvotes: 3