Kobe-Wan Kenobi
Kobe-Wan Kenobi

Reputation: 3864

Incremental PCA

I've never used incremental PCA which exists in sklearn and I'm a bit confused about it's parameters and not able to find a good explanation of them.

I see that there is batch_size in the constructor, but also, when using partial_fit method you can again pass only a part of your data, I've found the following way:

n = df.shape[0]
chunk_size = 100000
iterations = n//chunk_size

ipca = IncrementalPCA(n_components=40, batch_size=1000)

for i in range(0, iterations):
    ipca.partial_fit(df[i*chunk_size : (i+1)*chunk_size].values)

ipca.partial_fit(df[iterations*chunk_size : n].values)

Now, what I don't understand is the following - when using partial fit, does the batch_size play any role at all, or not? And how are they related?

Moreover, if both are considered, how should I change their values properly, when wanting to increase the precision while increasing memory footprint (and the other way around, decrease the memory consumption for the price of decreased accuracy)?

Upvotes: 2

Views: 1771

Answers (2)

BBSysDyn
BBSysDyn

Reputation: 4601

Here is some an incremental PCA code based on https://github.com/kevinhughes27/pyIPCA which is an implementation of CCIPCA method.

import scipy.sparse as sp
import numpy as np
from scipy import linalg as la
import scipy.sparse as sps
from sklearn import datasets

class CCIPCA:    
    def __init__(self, n_components, n_features, amnesic=2.0, copy=True):
        self.n_components = n_components
        self.n_features = n_features
        self.copy = copy
        self.amnesic = amnesic
        self.iteration = 0
        self.mean_ = None
        self.components_ = None
        self.mean_ = np.zeros([self.n_features], np.float)
        self.components_ = np.ones((self.n_components,self.n_features)) / \
                           (self.n_features*self.n_components)

    def partial_fit(self, u):
        n = float(self.iteration)
        V = self.components_

        # amnesic learning params
        if n <= int(self.amnesic):
            w1 = float(n+2-1)/float(n+2)    
            w2 = float(1)/float(n+2)    
        else:
            w1 = float(n+2-self.amnesic)/float(n+2)    
            w2 = float(1+self.amnesic)/float(n+2)

        # update mean
        self.mean_ = w1*self.mean_ + w2*u

        # mean center u        
        u = u - self.mean_

        # update components
        for j in range(0,self.n_components):

            if j > n: pass            
            elif j == n: V[j,:] = u
            else:       
                # update the components
                V[j,:] = w1*V[j,:] + w2*np.dot(u,V[j,:])*u / la.norm(V[j,:])
                normedV = V[j,:] / la.norm(V[j,:])
                normedV = normedV.reshape((self.n_features, 1))
                u = u - np.dot(np.dot(u,normedV),normedV.T)

        self.iteration += 1
        self.components_ = V / la.norm(V)

        return

    def post_process(self):        
        self.explained_variance_ratio_ = np.sqrt(np.sum(self.components_**2,axis=1))
        idx = np.argsort(-self.explained_variance_ratio_)
        self.explained_variance_ratio_ = self.explained_variance_ratio_[idx]
        self.components_ = self.components_[idx,:]
        self.explained_variance_ratio_ = (self.explained_variance_ratio_ / \
                                          self.explained_variance_ratio_.sum())
        for r in range(0,self.components_.shape[0]):
            d = np.sqrt(np.dot(self.components_[r,:],self.components_[r,:]))
            self.components_[r,:] /= d

You can test it with

import pandas as pd, ccipca

df = pd.read_csv('iris.csv')
df = np.array(df)[:,:4].astype(float)
pca = ccipca.CCIPCA(n_components=2,n_features=4)
S  = 10
print df[0, :]
for i in range(150): pca.partial_fit(df[i, :])
pca.post_process()

The resulting eigenvectors / values will not exaactly be the same as the batch PCA. Results are approximate, but they are useful.

Upvotes: 0

sascha
sascha

Reputation: 33532

The docs say:

batch_size : int or None, (default=None)

The number of samples to use for each batch. Only used when calling fit...

This param is not used within partial_fit, where the batch-size is controlled by the user.

Bigger batches will increase memory-consumption, smaller ones will decrease it. This is also written in the docs:

This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.

Despite some checks and parameter-heuristics, the whole fit-function looks like this:

for batch in gen_batches(n_samples, self.batch_size_):
    self.partial_fit(X[batch], check_input=False)

Upvotes: 3

Related Questions