Reputation: 591
I have a single text file that is about 500GB (ie a very large log file) and would like to build an implementation to search it quickly.
So far I have created my own inverted index with a SQLite Database but this doesn't scale well enough.
Can anyone suggest a fairly simple implementation that would allow quick searching of this massive document?
I have looked at Solr and Lucene but these look too complicated for a quick solution, I'm thinking a database with built in full-text indexing (MySQl, Raven, Mongo etc.) may be the simplest solution but have no experience with this.
Upvotes: 0
Views: 406
Reputation: 1106
convert log file to csv then csv import to mysql, mongodb etc.
mongodb:
for help :
mongoimport --help
json file :
mongoimport --db db --collection collection --file collection.json
csv file :
mongoimport --db db--collection collection --type csv --headerline --file collection.csv
Use the “--ignoreBlanks” option to ignore blank fields. For CSV and TSV imports, this option provides the desired functionality in most cases: it avoids inserting blank fields in MongoDB documents.
link Guide: mongoimport , mongoimport v2.2
then define index on collection and enjoy :-)
Upvotes: 0
Reputation: 27515
Since you are looking at text processing for log files I'd take a close look at the Elasticsearch Logstask Kibana stack. Elasticsearch provides the Lucene based text search. Logstash parses and loads the log file into Elasticsearch. And Kibana provides a visualization and query tool for searching and analyzing the data.
This is a good webinar on the ELK stack by one of their trainers: http://www.elasticsearch.org/webinars/elk-stack-devops-environment/
As an experienced MongoDB, Solr and Elasticsearch user I was impressed by how it easy it was to get all three components up and functional analyzing log data. And it also has a robust user community, both here on stackoverflow and elsewhere.
You can download it here: http://www.elasticsearch.org/overview/elkdownloads/
Upvotes: 1