MAS
MAS

Reputation: 4993

append column to a two dimensional variable

I have a variable. the variable is two dimensional but i don't know if it is a list or array. thinking about this variable as a matrix of size n by m. I want to append to it a column of size by 1. so my new variable would be n by m+1. this is how i am doing it:

train_data_features.append(train['NewsDesk'])

this is the error i am getting:

train_data_features.append(train['NewsDesk'])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/base.py", line 440, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: append not found

and this is my whole code:

import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from KaggleWord2VecUtility import KaggleWord2VecUtility
import pandas as pd
import numpy as np

if __name__ == '__main__':
    train = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'NYTimesBlogTrain.csv'), header=0)
    test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'NYTimesBlogTest.csv'), header=0)
    train["Headline"].fillna(0)
    print 'A sample headline is:'
    print train["Headline"][0:10]
    #raw_input("Press Enter to continue...")


    #print 'Download text data sets. If you already have NLTK datasets downloaded, just close the Python download window...'
    #nltk.download()  # Download text data sets, including stop words

    # Initialize an empty list to hold the clean reviews
    clean_train_reviews = []
    # Loop over each review; create an index i that goes from 0 to the length
    # of the movie review list
    print "Cleaning and parsing the training set headlines...\n"
    for i in xrange( 0, len(train["Headline"])):
    #for i in xrange( 0, 10):
        if pd.isnull(train["Headline"][i])==False:
            clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["Headline"][i], True)))
        else:
            clean_train_reviews.append(" ")
    print 'clean train reviews (headlines)'
    print clean_train_reviews  

    # ****** Create a bag of words from the training set
    #
    print "Creating the bag of words...\n"


    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.
    vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000)

    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of
    # strings.

    train_data_features = vectorizer.fit_transform(clean_train_reviews)
    print 'train_data_features'
    print train_data_features
    print 'train_data_features.shape'
    print train_data_features.shape
    # Take a look at the words in the vocabulary
    vocab = vectorizer.get_feature_names()
    print 'vocab'
    print vocab

    # Sum up the counts of each vocabulary word
    #dist = np.sum(train_data_features, axis=0)
    dist = train_data_features.sum (axis=0)
    print 'dist'
    print dist
    # For each, print the vocabulary word and the number of times it 
    # appears in the training set
    print 'tag+count'
    for tag, count in zip(vocab, dist):
        print count, tag
        print 'and'

#    for i in xrange( 0, len(train["NewsDesk"])):    
    for i in xrange( 0, 10):    
        if pd.isnull(train["NewsDesk"][i])==False:
            print train['NewsDesk'][i]
        else:
            print '   '

    train_data_features.append(train['NewsDesk'])

Upvotes: 0

Views: 195

Answers (1)

hpaulj
hpaulj

Reputation: 231385

There isn't an append for sparse matrices. But there is vstack and hstack. I'll illustrate with a simple matrix

In [121]: from scipy import sparse
In [122]: M = sparse.csr_matrix([[0,1,0],[1,0,1]])

In [123]: M.A   # show as array
Out[123]: 
array([[0, 1, 0],
       [1, 0, 1]], dtype=int32)

In [124]: M.todense()  # show a numpy matrix
Out[124]: 
matrix([[0, 1, 0],
        [1, 0, 1]], dtype=int32)

In [125]: col=np.array([[2],[3]])  # a simple column array
In [126]: col
Out[126]: 
array([[2],
       [3]])

In [128]: sparse.hstack([M,col])
Out[128]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in COOrdinate format>

In [129]: sparse.hstack([M,col]).A
Out[129]: 
array([[0, 1, 0, 2],
       [1, 0, 1, 3]], dtype=int32)

In [130]: sparse.vstack([M,[1,2,3]]).A   # or add a row
Out[130]: 
array([[0, 1, 0],
       [1, 0, 1],
       [1, 2, 3]], dtype=int32)

numpy append is just a fancy wrapper for np.concatenate. vstack and hstack are simpler wrappers. Also, append does not change the array in place (like the list append). It best to just avoid it, thinking instead in terms concatenate.

Upvotes: 1

Related Questions