thetna
thetna

Reputation: 7143

sublist for partitioning data set

I have an list of list whose indices reaches upto 100's of millions.Lets say each od the list inside list is an sentence of a text. I would like to partition this data for processing in different threads. I used subList to split data and send it in different threads for processing. Is this a standard approach for paritioning data? If not , could you please suggest me some standard approch for it?

Upvotes: 0

Views: 135

Answers (1)

Harold L
Harold L

Reputation: 5264

This will work as long as you do not "structurally modify" the list or any of these sub-lists. Read-only processing is fine.

There are many other "big data" approaches to handling 100s of millions of records, because there are other problems you might hit:

  • If your program fails (e.g. OutOfMemoryError), you probably don't want to have to start over from the beginning.
  • You might want to throw >1 machine at the problem, at which point you can't share the data within a single JVM's memory.
  • After you've processed each sentence, are you building some intermediate result and then processing that as a step 2? You may need to put together a pipeline of steps where you re-partition the data before each step.
  • You might find you have too many sentences to fit them all into memory at once.

A really common tool for this kind of work is Hadoop. You'd copy the data into HDFS, run a map-reduce job (or more than one job) on the data and then copy the data out of HDFS when you're done.

A simpler approach to implement is just to use a database and assign different ranges for the integer sentence_id column to different threads and build your output in another table.

Upvotes: 2

Related Questions