Reputation: 1001

Come up with data structure for working with a giant data in a matrix form (think of it as an excel sheet)

The problem is to come up with a data structure that can work with a giant excel sheet (obviously does not fit into the main memory as it is )

Imagine the following as part of excel sheet where e represents an empty cell.

  A B C D ...

1 3 9 e e ...

2 e e e e ...

3 e e 5 e ...

4 e e e e ...

5 e e 6 e ...

So the data structure should allow me to store the excel sheet into the memory (we know that only values in the excel sheet fit into the main memory) and support the following operations

getByColumn(Column col); - gives all values of a certain column, say 5,6 for Column C

getByRow(Row row); - gives all values of a certain row, say 3 and 9 and more for ROW 1

insertCell(Column col, Row row, int value); - inserts or overrides the value of a cell

getExcelSheet(FileName); - gives the whole excel sheet in a compressed form (data structure)

What is a thinkable data structure for this? I am preparing for an interview and this is no homework. WOuld like to gain some insights from different folks.

Just to give a sense: Say the excel sheet is 1 Terabyte, we have 8GB of memory. 1 terabyte excel sheet just have many many empty cells but values spread all over the different cells

Upvotes: 1

Answers (5)

rocky

Reputation: 7098

This is an excellent question that I don't believe has been given the attention and thought here that I think it deserves.

To come up with a data structure, one needs to define or describe some properties of spreadsheets that are desired.

A spreadsheet is thought of as infinite in two directions across the columns and down the rows. That is, at any time you can insert new column or row simply by adding cell data at some row or column index. Most often though, the data is confined to the upper-lefthand corner, or the low-indexed entries.

Because a spreadsheet can grow in two directions, clearly one cannot linearize the two-dimensional array into a one-dimensional array use a index calculation for (row, column) like row * column-length + column as C might do for a two dimensional fixed-length array.

Also note that a typical relational database is not suited here either, because relational databases require that the column lengths are fixed in length. And in a column-oriented database, the rows are fixed in length.

A second property of spreadsheets that comes up is that you sometimes want to access rows or columns as entities. In spreadsheets you can sum across a row or a column. Other row/column operations include duplicating, moving, or deleting them.

As mentioned in other places, some sort of hashing mechanism would work turn $(row, column$ into a linear array is needed. This and many other possibilities are listed in the Wikipedia entry on Sparce Matrices or Jagged Array.

Upvotes: 0

Erik P.

Reputation: 1627

An elaboration of Tass' comment and Mark's answer (for which +1):

You can insert cell values efficiently if you use what wikipedia calls Dictionary Of Keys or DOK (which is essentially Jens' answer), but as you rightly comment, getByRow and getByColumn will be fairly slow.

A better option would be what wikipedia calls Coordinate List or COO: just a set of triples (rowindex, columnindex, value). You'd probably actually store this as three arrays. In order to make insertion fast, keep a set of sorted and unsorted entries, and insert into the unsorted set; whenever the number of unsorted entries goes over a threshold T (which might depend on the total number of nonempty cells K), sort them into the sorted set.

You'll want to sort them all by, say, row index, and keep another array with indices into the arrays to give the version that is sorted by column index.

For getByRow you would take the correct section of the arrays sorted by row index, and additionally search through the unsorted set.

All of this assumes that you do have enough memory to store a couple of words for every nonempty entry in the matrix. If not, you'll need to combine this with some sort of external memory approach.

Upvotes: 1

High Performance Mark

Reputation: 78364

There is an extensive literature on the topic of sparse matrices, which is a widely-used term for what you call a giant Excel sheet. The literature covers both data structures and suitable algorithms for creating and modifying them; the Wikipedia article provides a good starting point for your research. It may tell you enough to prepare yourself for your interview.

Upvotes: 1

Andrew Walters

Reputation: 4803

You could store this magic excel sheet in a two dimensional array, with empty cells containing null. If the data won't fit in that either I think we're out of luck

Upvotes: -2

Jens Schauder

Reputation: 82008

Use a Map/Dictionary mapping cell coordinates to values, returning a default value of EMPTY_CELL for everything not explicitely set.

Implement the desired methods based on that.

Upvotes: 1

Come up with data structure for working with a giant data in a matrix form (think of it as an excel sheet)

Answers (5)

Related Questions