Reputation: 3433
I am quite new with pandas (couple of months) and I am starting building up a project that will be based on a pandas data array.
Such pandas data array will consist on a table including different kind of words present in a collection of texts (around 100k docs, and around 200 key-words).
imagine for instance the words "car" and the word "motorbike" and documents numbered doc1, doc2 etc.
how should I go about the arrangement? a) The name of every column is the doc number and the index the words "car" and "motorbike" or b) the other way around; the index being the docs numbers and the columns head the words?
I don't have enough insights of pandas in order to be able to foreseen what will the consequences of such choice. And all the code will be based on that decision.
As a side note there array is not static, there will be more documents and more words being added to the array every now and again.
what would you recommend? a or b? and why?
thanks.
Upvotes: 0
Views: 50
Reputation: 386
Generally in pandas, we follow a practice that instances are columns (here doc number) and features are columns (here words). So, prefer to use the approach 'b'.
Upvotes: 1