dasman
dasman

Reputation: 311

parallelizing string matching

I have to mine a large number of datasets and wanted to know if its better to get a desktop with a GPU or just spread the workload over separate machines?

I think with GPU I may have to write my own code using the something like CUDA toolkit.

the number of strings on which I have to perform a regex search is of the order of millions and I have to match a number of different keywords running into 10k so its like ~ 50 billion pattern matches. I want to spread the workload so that a million can be done on one core etc...

Any suggestions would help.

Upvotes: 0

Views: 158

Answers (1)

18bytes
18bytes

Reputation: 6029

As you want to process large dataset, Hadoop might be a solution. Hadoop implements Map-Reduce algorithm (Originally by Google). With Hadoop you can split your task into multiple sub-parts and let individual machine process each part.

The size (50 billion matches) you mentioned can be processed using a cluster of Hadoop nodes. If you do not have many machines, you can rent it from Amazon and they have Elastic mapreduce.

http://aws.amazon.com/elasticmapreduce/

http://hadoop.apache.org/

Upvotes: 1

Related Questions