Programming object for processing large lists of data

Question

I recently had the task of performing a cross-selection operation on some collections, to find an output collection that was matching my criteria. (I will omit the custom logic because it is not needed).

What I did was creating a class that was taking as a parameter Lists of elements, and I was then calling a function inside that class that was responsible for processing those lists of data and returning a value.

Point is, I'm convinced I'm not doing the right thing, because writing a class holding hundreds of elements, taking names lists as parameters, and returning another collection looks unconventional and awkward.

Is there a specific programming object or paradigm that allows you to process large numbers of large collections, maybe with a quite heavy custom selection/mapping logic?

I'm building for Android using Kotlin

Manushin Igor · Accepted Answer

First of all, when we talk about the performance, there is only one right answer - write benchmark and test.

About memory: list with 1,000,000 of unique Strings with average size 30 chars will take about 120 Mb (e.g. 10^6 * 30 * 4, where last is "size of char", let's think that this is Unicode character with 4 bytes). And please add 1-3% for collateral expenses, such as link references. Therefore: if you have hundreds of Strings then just load whole data into memory and use list, because this is the fastest solution (synchronous, immutable, etc.).

If you can do streaming-like operations, you can use sequences. They are pretty lazy, the same with Java Streams and .Net Linq. Please check example below, it requires small amount of memory.

fun countOfEqualLinesOnTheSamePositions(path1: String, path2: String): Flow {
    return File(path1).useLines { lines1 ->
        File(path2).useLines { lines2 ->
            lines1.zip(lines2)
                .map { (line1, line2) ->
                    line1 == line2
                }
                .count()
        }
    }
}

If you couldn't store whole data in memory and you couldn't work with stream-like schema, you may:

Rework algorithm to single-pass to multiple-pass, there each is stream-like. For example, Huffman Coding is two-pass algorithm, so it can be used to compress 1Tb of data by using small amount of memory.
Store intermediate data on the disk (this is much complex for this short answer).

For additional optimizations:

To cover case of merging a lot of parallel streams, please consider also Kotlin Flow. It allows you to work asynchronously, to avoid IO blocks. For example, this can be useful to merge ~100 network streams.
To keep a lot of non-unique items in memory, please consider caching logic. It can save memory (however please benchmark first).
Try operate with ByteBuffers, instead of Strings. You can get much less allocation (because you can deallocate object explicitly), however code will be too complex.

Programming object for processing large lists of data

Answers (1)

Related Questions