SubZeno
SubZeno

Reputation: 380

How to perform matrix union and intersection?

Supposing using NumPy, given two matrixes mat_A and mat_B:

              value 
date                                                     
2020-10-16    49.78
2020-10-15    50.30
2020-10-14    50.44
2020-10-13    50.85
2020-10-12    51.42
...             ...
2014-01-08    45.25
2014-01-07    45.18
2014-01-06    45.19
2014-01-03    45.21
2014-01-02    44.95

[1711 rows x 2 columns]

              value
date                                                     
2020-10-16     95.0
2020-10-15     95.0
2020-10-14     95.0
2020-10-13     95.0 
2020-10-12     95.0
...             ...
2012-08-08    209.0
2012-08-07    209.0
2012-08-06    209.0
2012-08-03    209.0
2012-08-02    209.0

[1361 rows x 2 columns]

Is there an efficient way to get the following?

Upvotes: 0

Views: 310

Answers (1)

Askold Ilvento
Askold Ilvento

Reputation: 1520

To intersect you may use intersect1d

_, inter, _ = numpy.intersect1d(a[:, 0], b[:, 0], assume_unique=True, return_indices=True)

Which gives you indexes of a in b and vice versa. This one is pivoted. And if you need not pivoted one then you may do it thrice

_, inter 1, _ = numpy.intersect1d(a[:, 1], b[:, 1], assume_unique=True, return_indices=True)

_, no_piv_inter, _ = numpy.intersect1d(inter1, inter2, assume_unique=True, return_indices=True)

For symmetric difference, there is setxor1d

diff = numpy.setxor1d(a[:, 0], b[:, 0], assume_unique=True, return_indices=True)

There is union1d. But it seems to do not work for your case because it does not return indexes. To join you may use vstack and unique.

v = numpy.vstack((a, b))
_, unique = numpy.unique(v[:,0], axis=0, return_index=True)
union = v[unique, :]

Then you may fill the difference

diff_idx = numpy.in1d(union[:, 0], diff, assume_unique=True)
union[diff_idx, 0] = None

I did not run all codes. Do not beat me for syntax errors. The idea should be correct.

This way is not efficient in terms of complexity. I believe that the data structure is wrong. For the requested operation hash table that is dict in python are preferable. But these numpyed operations supposed to be fast relative to loops.

Upvotes: 1

Related Questions