Stefan Kunze
Stefan Kunze

Reputation: 751

Handling large iterators - aggregation

  1. Let's say we have an Iterator of a (String, String)-Tuple.
  2. Said Iterator has a kazillion elements, possibly exhausting main memory.

What would you do if you had to aggregate it like the following:

The tuples have the form of (entityname, attributename) and you would have to populate a list of the attributenames. Also the iterator would be completely unordered and would never fit into memory.

(for instance the last and the first attibutename could correspond to the same entityname).

A concrete example:

("stackoverflow","users"),
("bear","claws"),
("stackoverflow","usesAjaxTechnology"),
("bear","eyes") 

after the aggregation -> :

("stackoverflow",List("users","usesAjaxTechnology")),
("bear",List("claws","eyes")).

I know that there are statemenst like groupBy and so on but this would assuming the iterator has a lof ot elements never work due to memory problems?

Upvotes: 0

Views: 88

Answers (2)

barnesjon
barnesjon

Reputation: 39

If you have that many elements I would assume that they are in a database or file of some type . I would take them out in manageable chunks and process them that way, writing them back to a db or new file. That would solve your problem with memory and allow you to perform this kind of processing.

If you are using MongoDb( which I recommend) your find query could easily pull out only the stackoverflow users and then your next statement could write that to a new collection. Same with the bear.

Upvotes: 0

fresskoma
fresskoma

Reputation: 25781

Well, lets take a look at what groupBy does:

scala> res0.groupBy(x => x._1)
res2: scala.collection.immutable.Map[String,List[(String, String)]] = 
    Map( bear -> List((bear,claws), (bear,eyes)),
         stackoverflow -> List((stackoverflow,users), (stackoverflow,usesAjaxTechnology))
    )

As you can see, it creates a Map of elements. Since it does so in memory, you'd obviously run into memory issues as the data grows larger than your RAM.

On the other hand, it'd be possible to construct a Map-like structure that, instead of holding all data in memory, writes them to the filesystem. The simplest such Map would create a file for every key (e.g. "bear" or "stackoverflow") in a certain directory, and write all the attributes into the corresponding files. This would require almost no memory usage, replacing it with very high disk usage.

I'm wondering if this is sort of an artificial requirement, or if you are actually facing a real problem where this is an issue. Also, I'm really interested in hearing what the real functional programming pros here have to say :)

Upvotes: 1

Related Questions