Kevin Thompson
Kevin Thompson

Reputation: 2506

How to troubleshoot pandas scikit-learn multidimensional scaling runs forever

EDIT It appears that it isn't necessarily a problem with the data in row 64. Rather the number 64 itself is magical and causes the problems. As I have continued to troubleshoot the problem, I wrote a script that randomly grabs 63 contiguous rows from the DataFrame and plots them. It runs quickly every time. But if I change it to 64 rows it never works and runs forever. End edit

I'm trying to visualize data in clusters using multidimensional scaling. I've created a DataFrame and that has dimensions of 1000 rows and 1964 columns. When I try to perform multidimensional scaling on the data the process runs forever. Curiously, I can't seem to end the process by doing a ctrl+c.

Through a process of trial and error I have discovered something magical about the 64th row of the dataset. If I run the process on 63 rows the whole thing is done in a couple seconds. If I bump that up to 64 rows, though, it will never end.

I'm really at a loss as to how I should even approach troubleshooting this. I went through the 1964 columns looking for differences between row 63 and row 64 hoping to find a strange value or something but nothing jumped out at me. Any other way I can get an idea of why 64 rows is so magical?

import pandas as pd
from pandas import DataFrame as df
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import manhattan_distances
from sklearn import manifold
from matplotlib import pyplot as plt
import prettyplotlib as ppl

malware_df = df.from_csv('malware_features.csv')

plottable = malware_df[[c for c in malware_df.columns if c != 'hash']]
plottable = plottable.head(63) # change this to 64 and everything stops working

euc = euclidean_distances(plottable)
mds = manifold.MDS(n_jobs=-1, random_state=1337, dissimilarity='precomputed')
pos_e = mds.fit(euc).embedding_

plottable['xpos'] = pos_e[:,0]
plottable['ypos'] = pos_e[:,1]
with ppl.pretty:
    fig, ax = ppl.subplots(figsize=(6,8))
ppl.scatter(ax, plottable.xpos, plottable.ypos)
plt.show()

Here is a link where you can download the file I'm using if that helps.https://drive.google.com/file/d/0BxZZOOgLl7vSTUlxc1BmMUFmTVU/edit?usp=sharing

Upvotes: 0

Views: 690

Answers (1)

xbello
xbello

Reputation: 7443

This has to be something with versions. In my computer (2003, 1 AMD core, 2 Gb RAM), this code runs in ~3 seconds:

#import pandas as pd
from pandas import DataFrame as df
from sklearn.metrics.pairwise import euclidean_distances
#from sklearn.metrics.pairwise import manhattan_distances
from sklearn import manifold
from matplotlib import pyplot as plt 
import prettyplotlib as ppl 

malware_df = df.from_csv('malware_features.csv')

plottable = malware_df[[c for c in malware_df.columns if c != 'hash']]
plottable = plottable.head(128) # change this to 64 and everything stops working

euc = euclidean_distances(plottable)
mds = manifold.MDS(n_jobs=-1, random_state=1337, dissimilarity='precomputed')
pos_e = mds.fit(euc).embedding_

plottable['xpos'] = pos_e[:,0]
plottable['ypos'] = pos_e[:,1]

fig, ax = ppl.subplots(figsize=(6,8))

ppl.scatter(ax, plottable.xpos, plottable.ypos)
plt.show()

To produce this graphic:

enter image description here

Notice I tried with 128 after trying 64 and not failing, to see what happened, changed the with ppl.pretty that raises an Error, and everything run fine. This is my pip freeze:

brewer2mpl==1.4
matplotlib==1.4.0
numpy==1.9.0
pandas==0.14.1
prettyplotlib==0.1.7
reportlab==3.1.8
scikit-learn==0.15.2
scipy==0.14.0

And python 2.7.3.

Upvotes: 1

Related Questions