user548084
user548084

Reputation: 499

Best structure for tables with more than 10000 columns

I am applying a group of data mining algorithms to a dataset comprised of a set of customers along with a large number of descriptive attributes that summarize various aspects of their past behavior. There are more than 10,000 attributes, each stored as a column in a table with the customer id as the primary key. For several reasons, it is necessary to pre-compute these attributes rather than calculating them on the fly. I generally try to select customer with a specified attribute set. The algorithms can combine any arbitrary number of these attributes together in a single SELECT statement and join the required tables. All the tables have the same number of rows (one per customer).

I am wondering what's the best way to structure these tables of attributes. Is it better to group the attributes into tables of 20-30 columns, requiring more joins on average but fewer columns per SELECT, or have tables with the maximum number of columns to minimize the number of joins, but having potentially all 10K columns joined at once?

I also thought of using one giant 3-column customerID-attribute-value table and storing all the info there, but it would be harder to structure a "select all customers with these attributes-type query that I need."

I'm using MySQL 5.0+, but I assume this is a general SQL-ish question.

Upvotes: 2

Views: 2142

Answers (3)

Sebas
Sebas

Reputation: 21532

As @odiszapc said, you have to use a meta-model structure, like for instance:

CREATE TABLE customer(ID INT NOT NULL PRIMARY KEY, NAME VARCHAR(64));
CREATE TABLE customer_attribute(ID INT NOT NULL, ID_CUSTOMER INT NOT NULL, NAME VARCHAR(64), VALUE VARCHAR(1024));

Return basic informations of given customer:

SELECT * FROM customers WHERE name='John';

Return customer(s) matching certain attributes:

SELECT c.* 
FROM customer c 
    INNER JOIN attribute a1 ON a1.id_customer = c.id 
                           AND a1.name = 'address' 
                           AND a1.value = '1078, c/ los gatos madrileños'
    INNER JOIN attribute a2 ON a2.id_customer = c.id 
                           AND a2.name = 'age' 
                           AND a2.value = '27'

Your generator should generate the inner joins on the fly.

Proper indexes on the tables should allow all this engine to go relatively fast (if we assume 10k attributes per customer, and 10k customers, that's actually pretty much a challenge...)

Upvotes: 2

codingbiz
codingbiz

Reputation: 26386

10,000 columns is much. The SELECT statement will be very long and messy if you wouldn't use *. I think you can narrow the attributes down to most useful and meaningful ones, eliminating others

Upvotes: 1

odiszapc
odiszapc

Reputation: 4109

From my expirience using tables with 10,000 columns is very-very-very bad idea. What if in future this number will be increased?

If there are a lot of attributes you shouldn't use a horizontal scaled tables (with large number of columns). You should create a new table attributes and place alltributes values into it. Then connect this table with Many-To-One relationship to main entry table

Maybe the second way is to use no-SQL (like MongoDB) systems

Upvotes: 5

Related Questions