Yana
Yana

Reputation: 975

Data profiling of columns for big table (SQL Server)

I have table with over 40 million records. I need to make data profiling, including Nulls count, Distinct Values, Zeros and Blancs, %Numeric, %Date, Needs to be Trimmed, etc.

The examples that I was able to find are always including implementation of the task using cursors. For big table such solution is performance killer.

I would be happy if I receive suggestions and examples which give better performance alternatives. Is it possible to create multiple stored procedures and combine the results in a table? I have not used stored procedures so far, so I base my question only on understanding that I got from documentation.

Upvotes: 0

Views: 750

Answers (1)

J.D.
J.D.

Reputation: 1023

As Gordon mentioned, you should include your table's schema and some sample data to get the best answers, but a couple things you can look into are as follows:

  1. Columnstore Indexes - These can be helpful for analytical querying against a table, e.g. SUM(), COUNT(), COUNT(DISTINCT) etc. This is because of the efficiencies in compression that can be achieved up and down the column for analytics. This is useful if you need a "real time" based on answer every time you query against the data.

  2. You can periodically stage and update the results in a data warehouse type table. You basically can store the results to those aggregations in it's own table and periodically update it with either a SQL Agent Job (this isn't necessarily a real time solution) or use triggers to automatically update your data warehouse table (which will be closer to a real time solution but can be performance heavy if not implemented in a lean manner).

  3. OLAP Cubes - This is more of an automated way to the above solution and has better maintainability but is also more advanced of a solution. This is a methodology for building out an actual OLAP based data warehouse.

In terms of difficulty of implementation and based on the size of your data (which isn't anything too huge) my recommendation would be to start with columnstore indexes and see how that helps your queries. I've had much success using them for analytical querying. Otherwise my remaining recommendations are in order of difficulty as well.

Upvotes: 1

Related Questions