Jai
Jai

Reputation: 3609

What is meant by sparse data/ datastore/ database?

Have been reading up on Hadoop and HBase lately, and came across this term-

HBase is an open-source, distributed, sparse, column-oriented store...

What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more about it.

Upvotes: 22

Views: 33509

Answers (5)

Kh.Taheri
Kh.Taheri

Reputation: 955

The best article I have seen, which explains many databases terms as well.

> http://jimbojw.com/#understanding%20hbase

Upvotes: 1

Ashish Singh
Ashish Singh

Reputation: 51

There are two way of data storing in the tables it will be either Sparse data and Dense data. example for sparse data.

Suppose we have to perform a operation on a table containing sales data for transaction by employee between the month jan2015 to nov 2015 then after triggering the query we will get data which satisfies above timestamp condition if employee didnt made any transaction then the whole row will return blank

eg. EMPNo Name Product Date Quantity

 1234  Mike    Hbase    2014/12/01     1
 5678                                        
 3454  Jole    Flume    2015/09/12   3

the row with empno5678 have no data and rest of the rows contains the data if we consider whole table with blanks row and populated row then we can termed it as sparse data.

If we take only populated data then it is termed as dense data.

Upvotes: 1

David
David

Reputation: 3261

At the storage level, all data is stored as a key-value pair. Each storage file contains an index so that it knows where each key-value starts and how long it is.

As a consequence of this, if you have very long keys (e.g. a full URL), and a lot of columns associated with that key, you could be wasting some space. This is ameliorated somewhat by turning compression on.

See: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

for more information on HBase storage

Upvotes: 4

Peter Wone
Peter Wone

Reputation: 18805

In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column).

This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers.

Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage.

Storage is cheap. Performance isn't.

Upvotes: 27

Donald Miner
Donald Miner

Reputation: 39943

Sparse in respect to HBase is indeed used in the same context as a sparse matrix. It basically means that fields that are null are free to store (in terms of space).

I found a couple of blog posts that touch on this subject in a bit more detail:

http://blog.rapleaf.com/dev/2008/03/11/matching-impedance-when-to-use-hbase/

http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Upvotes: 5

Related Questions