Storing NLP corpora in databases rather than csv?

Question

While implementing an NLP system, I wonder why CSV files are often used to store text Corpora in Academia and common Python Examples (in particular: NLTK-based). I have personally ran into issues, using a system that generates a number of corpora automatically and accesses them later.

These are issues that come from CSV files: - Difficult to automate back up - Difficult to ensure availability - Potential transaction race and thread accessing issues - Difficult to distribute/shard over multiple servers - Schema not clear or defined, if corpora becomes complicated - Accessing via a filename is risky. It could be altered. - File Corruption possible - Fine grained permissions not typically used for file-access

Issues from using MySQL, or MongooseDB: - Initial set up, keeping a dediated server running with DB instance online - Requires spending time creating and defining a Schema

Pros of CSV: - Theoretically easier to automate zip and unzipping of contents - More familiar to some programmers - Easier to transfer to another academic researcher, via FTP or even e-mail

Viewing multiple academic articles, even in cases of more advanced NLP research, for example undertaking Named Entity Recognition or statement extraction, research seems to use CSV.

Are there other advantages to the CSV format, that make it so widely used? What should an Industry system use?

0x5050 · Accepted Answer

I will organize the answer into two parts:

Why CSV:

A dataset for an nlp task, be it a classification or sequence annotation basically requires two things per each training instance in a corpus:

Text(might be a single token, sentence or document) to be annotated and optionally pre-extracted features.
Corresponding labels/tag.

Because of this simple tabular organization of data that is consistent across different NLP problems, CSV is a natural choice. CSV is easy to learn, easy to parse, easy to serialize and easy to include different encodings and languages. CSV is easy to work with Python(which is the most dominant for NLP) and there are excellent libraries like Pandas that makes it really easy to manipulate and re-organize the data.

Why not database

A database is an overkill really. An NLP model is always trained offline, i.e you fit all the data at once in an ML/DL model. There are no concurrency issues. The only parallelism that exists during training is managed inside a GPU. There is no security issue during training: you train the model in your machine and you only deploy a trained model in a server.

Storing NLP corpora in databases rather than csv?

Answers (1)

Related Questions