Reputation: 8002
I have a csv file with x, y, and z columns that represent coordinates in a 3-dimensional space. I need to create a distance matrix from each item over all other items.
I can easily read the csv with pandas read_csv function, resulting in a DataFrame like the following:
import pandas as pd
import numpy as np
samples = pd.DataFrame(
columns=['source', 'name', 'x', 'y', 'z'],
data = [['a', 'apple', 1.0, 2.0, 3.0],
['b', 'pear', 2.0, 3.0, 4.0],
['c', 'tomato', 9.0, 8.0, 7.0],
['d', 'sandwich', 6.0, 5.0, 4.0]]
)
I can then convert the separate x, y, z columns into a Series of tuples:
samples['coord'] = samples.apply(
lambda row: (row['x'], row['y'], row['z']),
axis=1
)
or a Series of lists:
samples['coord'] = samples.apply(
lambda row: [row['x'], row['y'], row['z']],
axis=1
)
But I cannot create a Series of arrays:
samples['coord'] = samples.apply(
lambda row: np.array([row['x'], row['y'], row['z']]),
axis=1
)
I get the ValueError, "Shape of passed values is (4,3), indices imply (4,6)"
I'd really like to have the data prepped so that I can simply call the scipy's distance_matrix function, which expects two arrays, as follows:
dmat = scipy.spatial.distance_matrix(
samples['coord'].values,
samples['coord'].values
)
I am, of course, open to any more pythonic or more efficient way to achieve this goal if my approach is poor.
Upvotes: 0
Views: 852
Reputation: 85442
This stores NumPy array in coords
:
samples['coord'] = list(samples[['x', 'y', 'z']].values)
Now:
>>> samples.coord[0]
array([ 1., 2., 3.])
Upvotes: 3
Reputation: 8002
I figured out that I can just extract a numpy array from the dataframe and use it to get the distance matrix.
sample_array = np.array(samples[['x', 'y', 'z']])
dmat = scipy.spatial.distance_matrix(sample_array, sample_array)
But I'd still like to have those little arrays embedded in the dataframe, alongside the other data, and I'd upvote and accept an answer that can do that.
Upvotes: 0