Reputation: 2296
I have two dataframes let's call first one df and the second one compare_df: First one is like this:
Date cell tumor_size (assume it is three dimensional)
25/10/2015 113 [51, 52, 55]
22/10/2015 222 [50, 68, 22]
22/10/2015 883 [45, 23, 67]
20/10/2015 334 [35, 23, 76]
and second one is like that:
Date cell tumor_size
19/10/2015 564 [47, 23, 56]
19/10/2015 123 [56, 11, 23]
22/10/2014 345 [36, 66, 78]
13/12/2013 456 [44, 21, 83]
For each row in the dataframe I want to go through each row in the second dataframe and record the euclidean distances then get the minimum one. This is my code tries to accomplish this:
# These will be our lists of pairs and size differences.
pairs = []
diffs = []
for row in df.itertuples():
compare_df['distance'] = np.linalg.norm(compare_df.tumor_size - row.tumor_size) # I get error for this line
row_of_interest = compare_df.loc[compare_df.distance == compare_df.distance.min()]
pairs.append(row_of_interest.cell.values[0])
diffs.append(row_of_interest.distance.values[0])
df['most_similar_to'] = pairs
df['similarity'] = diffs
However I get:
ValueError: Length of values does not match length of index
Although size of the vectors are the same, and I drop Nan
values. Any ideas?
Upvotes: 2
Views: 375
Reputation: 2361
Your mistake is in trying to subtract a pd.Series
of large size (compare_df.tumor_size
) from a list
of size three (row.tumor_size
). When subtracting list
/tuple
from pd.Series
, pandas
tries to match both sequences and subtract each two matching rows. However, when the list
and the pd.Series
are of different size, it doesn't know how to match, and raises the exception.
Judging from the error code, your pandas
version is probably a bit old. You can try to use apply
to force the subtraction operator to be used row by row:
compare_df.tumor_size.apply(
lambda compare_size: np.array(compare_size) - np.array(row.tumor_size)
)
Of course, it may be beneficial to convert all list to np.array
ahead of time.
If you don't like np.array
, you can use:
compare_df.tumor_size.apply(
lambda compare_size: [compare_size[i] - row.tumor_size[i] for i in range(3)]
)
In pandas 0.21.0
(perhaps a bit earlier), you would have got a different error message:
TypeError: unsupported operand type(s) for -: 'list' and 'list'
In this case, there is an easier solution - just convert the list to an np.array
, and it will work like magic
compare_df.tumor_size - np.array(row.tumor_size)
For me, this work with pandas==0.21.0
and numpy==1.13.3
.
Upvotes: 2