Reputation: 503

Scipy Negative Distance? What?

I have an input file which contains floating point numbers to 4 decimal place:

i.e. 13359    0.0000    0.0000    0.0001    0.0001    0.0002`    0.0003    0.0007    ...

(the first is the id). My class uses the loadVectorsFromFile method which multiplies it by 10000 and then int() these numbers. On top of that, I also loop through each vector to ensure that there are no negative values inside. However, when I perform _hclustering, I am continually seeing the error, "LinkageZcontains negative values".

I seriously think this is a bug, because:

I checked my values,
the values are no where small enough or big enough to approach the limits of the floating point numbers and
the formula that I used to derive the values in the file uses absolute value (my input is DEFINITELY right).

Can someone enligten me as to why I am seeing this weird error? What is going on that is causing this negative distance error?

=====

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
    """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
    """
    vectors = {}
    self.winfo("Each vector is set to have %d limit in length" % limit)
    with open( loc ) as inf:
        for line in filter(None, inf.read().split('\n')):
            l = line.split('\t')
            if limit:
                scores = map(float, l[1:limit+1])
            else:
                scores = map(float, l[1:])

            if inflate:        
                vectors[ l[0]] = map( lambda x: int(x*10000), scores)     #int might save space
            else:
                vectors[ l[0]] = scores                           

    if assertAllPositive:
        #Assert that it has no negative value
        for dirID, l in vectors.iteritems():
            if reduce(operator.or_, map( lambda x: x < 0, l)):
                self.werror( "Vector %s has negative values!" % dirID)
    return vectors

def main( self, inputDir, outputDir, limit=0,
        inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
    """
    Loads vector from a file and start clustering
    INPUT
        vectors is { featureID: tfidfVector (list), }
    """
    IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
    if not os.path.exists(outputDir):
        os.makedirs(outputDir)

    vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
    for threshold in map( lambda x:float(x)/30, range(20,30)):
        clusters = self._hclustering(threshold, vectors)
        if clusters:
            outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
            with open(outputLoc, 'w') as outf:
                for clusterNo, cluster in clusters.iteritems():
                    outf.write('%s\n' % str(clusterNo))
                    for featureID in cluster:
                        feature, group = IDFeatureDic[featureID]
                        outline = "%s\t%s\n" % (feature, group)
                        outf.write(outline.encode('utf-8'))
                    outf.write("\n")
        else:
            continue

def _hclustering(self, threshold, vectors):
    """function which you should call to vary the threshold
    vectors:    { featureID:    [ tfidf scores, tfidf score, .. ]
    """
    clusters = defaultdict(list)
    if len(vectors) > 1:
        try:
            results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
        except ValueError, e:
            self.werror("_hclustering: %s" % str(e))
            return False

        for i, featureID in enumerate( vectors.keys()):

Upvotes: 11

Answers (5)

Indira Kurmantayeva

Reputation: 11

I had the same issue. What you can do is rewrite the cosine function. For example:

from sklearn.metrics.pairwise import cosine_similarity
def mycosine(x1, x2):
    x1 = x1.reshape(1,-1)
    x2 = x2.reshape(1,-1)
    ans = 1 - cosine_similarity(x1, x2)
    return max(ans[0][0], 0)

...

clusters = hierarchy.fclusterdata(data, threshold, criterion='distance', metric=mycosine, method='average')

Upvotes: 1

Alok Nayak

Reputation: 2541

"Linkage Z contains negative values". This error also occurs in scipy heirarchical clustering process when any linkage cluster index in linkage matrix is assigned -1.

As per my observations, any linkage cluster index gets assigned -1 during the combine processs, when the distance between all pairs of clusters or points to combine, comes out to be minus infinity. So linkage function combines clusters with even if linkage distance between them is -infinite. And assign one of the cluster or point negative index

summary So the point is, if you are using cosine distance as metric and if the norm or magnitude of any data point is zero, then this error will occurs

Upvotes: 1

dkar

Reputation: 2123

This is because of floating-point inaccuracy, so some distances between your vectors, instead of being 0, are for example -0.000000000000000002. Use scipy.clip() function to correct the problem. If your distance matrix is dmatr, use numpy.clip(dmatr,0,1,dmatr) and you should be ok.

Upvotes: 9

extraneon

Reputation: 23970

I'm not able to improve the answer of Justin, but another point of note is your data handling.

You say you do something like int( float("0.0003") * 10000 ) to read the data. But if you do that you'd get not 3 but 2.9999999999999996. That's because the floating point inaccuracies just get multiplied.

A better, or at least more accurate. way would be by doing the multiplication in the string. That is, using string manipulation to get from 0.0003 to 3.0 and so forth.

Perhaps there even is an Python data type extension somewhere which can read in this kind of data without loss of precision on which you can perform the multiplication before conversion. I'm not at home in SciPy/numerics so I don't know.

EDIT

Justin commented that there is a decimal type build within python. And that can interpret strings, multiply with integers and convert to float (I tested that). That being the case I would recommend updating your logic like:

factor = 1
if inflate:
  factor = 10000
scores = map(lambda x: float(decimal.Decimal(x) * factor), l[1:])

That would at reduce your rounding problems a bit.

Upvotes: 0

Justin Peel

Reputation: 47082

I'm pretty sure that this is because you are using the cosine metric when you are calling fclusterdata. Try using euclidean and see if the error goes away.

The cosine metric can go negative if the dot product of two vectors in your set is greater than 1. Since you are using very large numbers and normalizing them, I'm pretty sure that the dot products are greater than 1 a lot of the time in your data set. If you want to use the cosine metric, then you'll need to normalize your data such that the dot product of two vectors is never greater than 1. See the formula on this page to see what the cosine metric is defined as in Scipy.

Edit:

Well, from looking at the source code I think that the formula listed on that page isn't actually the formula that Scipy uses (which is good because the source code looks like it is using the normal and correct cosine distance formula). However, by the time it creates to the linkage, there are clearly some negative values in the linkage for whatever reason. Try finding the distance between your vectors with scipy.spatial.distance.pdist() with method='cosine' and check for negative values. If there aren't any, then it has to do with how the linkage is formed using the distance values.

Upvotes: 5

Scipy Negative Distance? What?

Answers (5)

Related Questions