Reputation: 21830
In a perfect world, I'd have a bunch of data readily available to me without any time spent asking for and receiving it. But in the context of real applications, like google or facebook, you have a mountain of data stored in a database that requires time to query, and then you're trying process that data in order to draw meaningful conclusions / relationships.
In the context of counting and sorting lots of data in sql, you'd store data in summary tables to avoid the processing... and just update those tables with cron. But statistical analysis and nlp seems to be different.
The question is, at what point in the lifespan of data should the actually statistical/nlp/etc analysis occur?
Upvotes: 0
Views: 126
Reputation: 5082
The way you usually do this is collect data, have it some sort of database (SQL or NoSQL) and then for processing dump it into a hadoop grid if it's a huge amount of data; otherwise do whatever you usually do. Then you have jobs analyzing that data and feeding the results back to you.
Get data -> Store it -> Dump it -> Analyze it -> Use results of offline analysis
Data crunching on an actual database just doesn't work too well.
Upvotes: 1
Reputation: 16099
Depends what you have in mind when you say NLP. The moment a few tens of tweets/status updates stored somewhere you can start reading and analysing them. It's probably not a great idea to repeatedly query your only production server while the NLP is taking place- you might want to take a dump of whatever data there are and work from there.
Upvotes: 0