Reputation: 81
I'm working with this dataset on Pandas, and I'm currently stuck at this step:
I have a dataframe that looks like this:
id1 id2 id3 id4
id1 1 0.3 0.5 0.2
id2 0.2 1 0.4 0.7
id3 0 0.5 1 0.8
id4 0.6 0.1 0 1
Now, id<num>
refers to the post_id
of associated with each message. The above cosine similarity matrix is a 2D-array after I did some work on finding the cosine similarity score between those different messages.
The end goal is to display on a UI and aggregate the posts that appear similar to each other. For this, I will need the overview of scores in id1
and id2
, and id1
and id3
and so on.
However, I have now a diagonal of 1's (which makes sense since they are the same), but how can I do this in a better way such that I don't have to use double for loops, and ultimately bring it to a state where I can display it like the below in a dataframe.
What I'm currently doing gives me this:
id1, id1, score
id1, id2, score
id1, id3, score
id1, id4, score
id2, id1, score
id2, id2, score
id2, id3, score
id2, id4, score
id3, id4, score
id3, id1, score
id3, id2, score
id3, id3, score
id4, id1, score
id4, id2, score
id4, id3, score
id4, id4, score
What I want to get is this:
id1, id2, score
id1, id3, score
id1, id4, score
id2, id3, score
id2, id4, score
id3, id4, score
How can I achieve this? Should I reshape the 2D matrix to arrays? I feel like I'm missing something here.
Appreciate your help in this issue.
Upvotes: 2
Views: 1051
Reputation: 88226
You could insert NaN
s in the upper triangular part of the dataframe, stack
to remove those NaN
s and build a dict from the iterator returned by pd.Series.iteritems
:
a = df.values
a[np.tril_indices(a.shape[0], 0)] = np.nan
df[:] = a # or directly create a new one -> pd.Dat...(a, index=df.index...)
out = dict(df.stack().iteritems())
print(out)
{('id1', 'id2'): 0.3,
('id1', 'id3'): 0.5,
('id1', 'id4'): 0.2,
('id2', 'id3'): 0.4,
('id2', 'id4'): 0.7,
('id3', 'id4'): 0.8}
Note that with
a = df.values
a[np.tril_indices(a.shape[0], 0)] = np.nan
You're ignoring similarities that you'll already have:
print(a)
[[nan 0.3 0.5 0.2]
[nan nan 0.4 0.7]
[nan nan nan 0.8]
[nan nan nan nan]]
So that by then updating the dataframe, and stacking, you'll end up with the combinations of interest:
df[:] = a
df.stack()
id1 id2 0.3
id3 0.5
id4 0.2
id2 id3 0.4
id4 0.7
id3 id4 0.8
dtype: float64
Upvotes: 1
Reputation: 5502
You can try stack
:
out = df.stack() \
.reset_index() \
.rename(columns={"level_0": "col1", "level_1": "col2", 0: "score"})
out = out[out.col1 != out.col2]
output
print(out)
# col1 col2 score
# 1 id1 id2 0.3
# 2 id1 id3 0.5
# 3 id1 id4 0.2
# 4 id2 id1 0.2
# 6 id2 id3 0.4
# 7 id2 id4 0.7
# 8 id3 id1 0.0
# 9 id3 id2 0.5
# 11 id3 id4 0.8
# 12 id4 id1 0.6
# 13 id4 id2 0.1
# 14 id4 id3 0.0
Explanation:
stack
to stack all values in one column:print(df.stack())
# id1 id1 1.0
# id2 0.3
# id3 0.5
# id4 0.2
# id2 id1 0.2
# id2 1.0
# id3 0.4
# id4 0.7
# id3 id1 0.0
# id2 0.5
# id3 1.0
# id4 0.8
# id4 id1 0.6
# id2 0.1
# id3 0.0
# id4 1.0
reset_index
rename
col1
equals col2
Upvotes: 1
Reputation: 4893
Try this:
1) multiply dataframe by numpy.tril(df.shape) - adjust tril parameters to zero the unwanted part of table
2) pandas.melt() data frame to desired format
3) filter out lines with zeros
Upvotes: 0