Dbercules
Dbercules

Reputation: 719

Memory Error when attempting to apply 'fit_transform()' on TFidfVectorizer containing Pandas Dataframe column (containing strings)

I'm attempting a similar operation as shown here. I begin with reading in two columns from a CSV file that contains 2405 rows in the format of: Year e.g. "1995" AND cleaned e.g. ["this", "is, "exemplar", "document", "contents"], both columns utilise strings as data types.

    df = pandas.read_csv("ukgovClean.csv", encoding='utf-8', usecols=[0,2])

I have already pre-cleaned the data, and below shows the format of the top 4 rows:

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

    [IN] df['cleaned'].head()

   [OUT] 0    acquaint hous receiv follow letter clerk crown...
         1    ask secretari state war whether issu statement...
         2    i beg present petit sign upward motor car driv...
         3    i desir ask secretari state war second lieuten...
         4    ask secretari state war whether would introduc...
         Name: cleaned, dtype: object

Then I initialise the TfidfVectorizer:

    [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8')

Following this, calling upon the below line results in:

    [IN] x = v.fit_transform(df['cleaned'])
   [OUT] ValueError: np.nan is an invalid document, expected byte or unicode string.

I overcame this using the solution in the aforementioned thread:

    [IN] x = v.fit_transform(df['cleaned'].values.astype('U'))

however, this resulted in a Memory Error (Full Traceback).

I've attempted to look up storage using Pickle to circumvent mass-memory usage, but I'm not sure how to filter it in in this scenario. Any tips would be much appreciated, and thanks for reading.

[UPDATE]

@pittsburgh137 posted a solution to a similar problem involving fitting data here, in which the training data is generated using pandas.get_dummies(). What I've done with this is:

    [IN] train_X = pandas.get_dummies(df['cleaned'])
    [IN] train_X.shape
   [OUT] (2405, 2380)

    [IN] x = v.fit_transform(train_X)
    [IN] type(x)
   [OUT] scipy.sparse.csr.csr_matrix

I thought I should update any readers while I see what I can do with this development. If there are any predicted pitfalls with this method, I'd love to hear them.

Upvotes: 1

Views: 2673

Answers (1)

Brad Solomon
Brad Solomon

Reputation: 40888

I believe it's the conversion to dtype('<Unn') that might be giving you trouble. Check out the size of the array on a relative basis, using just the first few documents plus an NaN:

>>> df['cleaned'].values
array(['acquaint hous receiv follow letter clerk crown',
       'ask secretari state war whether issu statement',
       'i beg present petit sign upward motor car driv',
       'i desir ask secretari state war second lieuten',
       'ask secretari state war whether would introduc', nan],
      dtype=object)

>>> df['cleaned'].values.astype('U').nbytes
1104

>>> df['cleaned'].values.nbytes
48

It seems like it would make sense to drop the NaN values first (df.dropna(inplace=True)). Then, it should be pretty efficient to call v.fit_transform(df['cleaned'].tolist()).

Upvotes: 1

Related Questions