Reputation:

How to handle large yet not big-data datasets?

I have a ~200gb dataset of approx 1.5 bln observations, on which I need to run some conditional analysis and data aggregation*.

The thing is that I'm not used to (nor trained to handle) large datasets. I usually work on R or Python (with some Julia on the side), and I am completely lost when I just can't fit the dataset into memory.

How do people handle these datasets, that fit on the disk but not in the memory ? Where should I start looking for solutions ? Is there a place where information on large yet not big data datasets is centralized ?

*Long story short, I have another dataset (which fits into memory), and for each row of this small dataset I want to count the number of observations in the large dataset that match some conditions from the small dataset. My initial reaction was to run the code in chunks, but this is very inefficient and would take centuries of monoprocessor computing time.

Since it has been specifically asked, I will describe the structure of my file.

I have a big file, let us call it BIG, with (notably) two ID variables, $ID0$ and $ID1$ and a date variable $date1$.

I have a small file, let us call it SMALL, with two ID variables, $ID2$ and $ID3$, and a date variable $date2$.

For each $ID2_i$, I want to count all observation such that $\{ID0 = ID2_i, date1<date2_i, ID1=ID2_j | j : ID3_j = ID3_i \cap date2_j < date2_i \}$

Upvotes: 6

Answers (3)

nbenn

Reputation: 692

I might be misunderstanding your problem, but chunking the big file (as suggested in comments already) seems to me the most straightforward approach.

Say you divide the 200 GB file into 100 chunks, you then iterate over the chunks and for each chunk do the desired counting, followed by aggregation of the results. If the per-chunk operation runs in minutes, you should be fine unless you want to do this over and over again.

For more concrete suggestions, I'd need to know a bit more about the data storage format. Are we talking about a big .csv file? In that case, for R you might look into the chunked API of the readr package. For doing the counting as quickly as possible again in R, the data.table package might come in handy.

Edit: Adding some example code

This won't do exactly what you requested, but hopefully covers some of the key points to make a solution as I suggested work.

library(data.table)
library(readr)

ids <- seq.int(1, 1e2)
dates <- seq(as.Date("1999/01/01"), as.Date("2000/01/01"), by = "day")

big <- data.table(id0 = sample(ids, 1e6, replace = TRUE),
                  id1 = sample(ids, 1e6, replace = TRUE),
                  date1 = sample(dates, 1e6, replace = TRUE))

write.csv(big, "big.csv", row.names = FALSE)

small <- data.table(id2 = sample(ids, 1e2),
                    id3 = sample(ids, 1e2),
                    date2 = sample(dates, 1e2))

count_fun <- function(x, pos, acc) {
  setDT(x)
  tmp <- small[x, list(counts = .N),
               on = c("id2 == id0", "id3 == id1", "date2 > date1"),
               by = .EACHI, nomatch = NULL]
  acc[tmp$id2] <- acc[tmp$id2] + tmp$counts
  acc
}

accumulator <- AccumulateCallback$new(count_fun, acc = rep(0, length(ids)))

counts <- read_csv_chunked("big.csv", accumulator, chunk_size = 1e4)

Upvotes: 3

julian bechtold

Reputation: 2251

There are different methods

Chunk up the dataset (saves time in future but needs initial time invest)

Chunking allows you to ease up many operations such as shuffling and so on.

Make sure each subset/chunk is representative of the whole Dataset. Each chunk file should have the same amount of lines.

This can be done by appending a line to one file after another. Quickly, you will realize that it's inefficient to open each file and write a line. Especially while reading and writing on the same drive.
-> add Writing and Reading buffer which fits into memory.

Choose a chunksize that fits your needs. I choose this particular size because my default text editor can still open it fairly quickly.

Smaller chunks can boost performance especially if you want to get metrics such as class ditribution because you only have to loop through one representative file to get an estimation of the whole dataset which might be enough.
Bigger chunkfiles do have a better representation of the whole dataset in each file but you could as well just go through x smaller chunkfiles.

I do use c# for this because I am way more experienced there and thus I can use the full featureset such as splitting the tasks reading / processing / writing onto different threads.

If you are experienced using python or r, I suspect there should be simillar functionalities as well. Parallelizing might be a huge factor on such large Datasets.

Chunked datasets can be modeled into one interleaved dataset which you can process with tensor processing units. That would probably yield one of the best performances and can be executed locally as well as in the cloud on the really big machines. But this requires a lot of learning on tensorflow.

Using a reader and read the file step by step

instead of doing something like all_of_it = file.read() you want to use some kind of streamreader. The following function reads through one of the chunk files (or your whole 300gb dataset) line by line to count each class within the file. By processing one line at a time, your program will not overflow the memory.

you might want to add some progress indication such as X lines/s or X MBbs in order to make an estimation of the total process time.

def getClassDistribution(path):
    classes = dict()
    # open sample file and count classes
    with open(path, "r",encoding="utf-8",errors='ignore') as f:
        line = f.readline()
        while line:
            if line != '':
                labelstring = line[-2:-1]
                if labelstring == ',':
                    labelstring = line[-1:]
                label = int(labelstring)
                if label in classes:
                    classes[label] += 1
                else:
                    classes[label] = 1
            line = f.readline()
    return classes

I use a combination of chunked datasets and estimation.

Pitfalls for performance

~~whenever possible,~~ avoid nested loops. Each loop inside another loop multiplies the complexity by n
~~whenever possible,~~ process the data in one go. Each loop after another adds a complexity of n
if your data comes in csv format, avoid premade functions such as cells = int(line.Split(',')[8]) this will lead very quickly to a memory throughput bottleneck. One proper example of this can be found in getClassDistributionwhere I only want to get the label.

the following C# function splits a csv line into elements ultra fast.

// Call function
ThreadPool.QueueUserWorkItem((c) => AnalyzeLine("05.02.2020,12.20,10.13").Wait());

// Parralelize this on multiple cores/threads for ultimate performance
private async Task AnalyzeLine(string line)
{
    PriceElement elementToAdd = new PriceElement();
    int counter = 0;
    string temp = "";
    foreach (char c in line)
    {
        if (c == ',')
        {
            switch (counter)
            {
                case 0:
                    elementToAdd.spotTime = DateTime.Parse(temp, CultureInfo.InvariantCulture);
                    break;
                case 1:
                    elementToAdd.buyPrice = decimal.Parse(temp);
                    break;
                case 2:
                    elementToAdd.sellPrice = decimal.Parse(temp);
                    break;
            }
            temp = "";
            counter++;
        }
        else temp += c;
    }
    // compare the price element to conditions on another thread
    Observate(elementToAdd);
}

Create a database and load the data

when processing csv like data you can load the data into a Database.
Databases are made to accommodate for huge amount of data and you can expect very high performance.
A Database will likely use up way more space on your disk than raw data. This is one reason why I moved away from using a database.

Hardware Optimisations

If your code is optimized well your bottleneck will most likely be the hard drive throughput.

If the Data fits onto your local hard drive, use it locally as this will get rid of network latencies (imagine 2-5ms for each record in a local network and 10-100ms in remote locations).
Use a modern Harddrive. A 1tb NVME SSD costs around 130 today (intel 600p 1tb). An nvme ssd is using pcie and is around 5 times faster than a normal ssd and 50 times faster than a normal hard drive, especially when writing to different locations fastly (chunking up data). SSDs have cought up vastly in capacity in the recent years and for such a task it would be savage.

The following screenshot provides a performance comparisation of tensorflow training with the same data on the same machine. Just one time saved locally on a standard ssd and one time on a network attached storage in the local network (normal hard disk).

Upvotes: 3

UserMach2

Reputation:

Looks like an O(n^2) problem: each element in BIG has to be compared with all the others in BIG.

Maybe you can fit all fields required in memory for the comparison (leaving in the file the rest). For example: 1.5G observations x 1 date (4 bytes) x 2 IDs (8 bytes) can fit in 18GB.

Maybe you can sort BIG by date and then your problem becomes O(n x log(n)).

Maybe you can split BIG in chunks where ID3i = ID3j.

There's lots of possibilities.

Upvotes: 1