Torcione
Torcione

Reputation: 845

Programming object for processing large lists of data

I recently had the task of performing a cross-selection operation on some collections, to find an output collection that was matching my criteria. (I will omit the custom logic because it is not needed).

What I did was creating a class that was taking as a parameter Lists of elements, and I was then calling a function inside that class that was responsible for processing those lists of data and returning a value.

Point is, I'm convinced I'm not doing the right thing, because writing a class holding hundreds of elements, taking names lists as parameters, and returning another collection looks unconventional and awkward.

Is there a specific programming object or paradigm that allows you to process large numbers of large collections, maybe with a quite heavy custom selection/mapping logic?

I'm building for Android using Kotlin

Upvotes: 0

Views: 646

Answers (1)

Manushin Igor
Manushin Igor

Reputation: 3689

First of all, when we talk about the performance, there is only one right answer - write benchmark and test.

About memory: list with 1,000,000 of unique Strings with average size 30 chars will take about 120 Mb (e.g. 10^6 * 30 * 4, where last is "size of char", let's think that this is Unicode character with 4 bytes). And please add 1-3% for collateral expenses, such as link references. Therefore: if you have hundreds of Strings then just load whole data into memory and use list, because this is the fastest solution (synchronous, immutable, etc.).

If you can do streaming-like operations, you can use sequences. They are pretty lazy, the same with Java Streams and .Net Linq. Please check example below, it requires small amount of memory.

fun countOfEqualLinesOnTheSamePositions(path1: String, path2: String): Flow<String> {
    return File(path1).useLines { lines1 ->
        File(path2).useLines { lines2 ->
            lines1.zip(lines2)
                .map { (line1, line2) ->
                    line1 == line2
                }
                .count()
        }
    }
}

If you couldn't store whole data in memory and you couldn't work with stream-like schema, you may:

  • Rework algorithm to single-pass to multiple-pass, there each is stream-like. For example, Huffman Coding is two-pass algorithm, so it can be used to compress 1Tb of data by using small amount of memory.
  • Store intermediate data on the disk (this is much complex for this short answer).

For additional optimizations:

  • To cover case of merging a lot of parallel streams, please consider also Kotlin Flow. It allows you to work asynchronously, to avoid IO blocks. For example, this can be useful to merge ~100 network streams.
  • To keep a lot of non-unique items in memory, please consider caching logic. It can save memory (however please benchmark first).
  • Try operate with ByteBuffers, instead of Strings. You can get much less allocation (because you can deallocate object explicitly), however code will be too complex.

Upvotes: 2

Related Questions