bgbrink
bgbrink

Reputation: 663

Convert a list of interactions into a matrix in Python

Whats the most efficient way to convert a list of interactions such as this:

QWERT     ASDF      12
QWERT     ZXCV      15
QWERT     HJKL      6
:             :     :
ASDF-XYZ  HJKL-XYY  123

into an all vs all matrix representation such as this:

            QWERT   ASDF   ZXCV   ...   ASDF-XYZ
QWERT       0       12     15     ...   9
ASDF        12      0      45     ...   35
ZXCV        15      45     0      ...   24
:           :       :      :      :     :
ASDF-XYZ    9       35     24     ...   0

It could be a few thousand up to several hundred thousands of features, so speed does matter.

Edit: The input is a csv file. Please note that the feature names are arbitrary (but unique) strings and that missing interaction should be represented as 0 in the output matrix. Made the example more clear.

Upvotes: 3

Views: 369

Answers (2)

amdex
amdex

Reputation: 781

Since you're reading a CSV, you could use pandas and pivot. This will not give you an n * n array, but an n1 * n2 array, where n1 and n2 are the unique values in the first and second column, respectively.

import pandas as pd

# For exposition, replace with data.
df = pd.DataFrame([["XYZ", "ABC", 10], 
                   ["ASDF", "XYZ", 100],
                   ["BSDF", "ABC", 1000]], columns=("id1", "id2", "value"))

pv = pd.pivot_table(df, 
                    values="value", 
                    index="id1",
                    columns="id2",
                    fill_value=0)

Upvotes: 1

Reznik
Reznik

Reputation: 2806

You can use numpy for this lets say the input:

points = [(1,2,12), (1,3,15), (1,4,6)]

the first point is on the cordinates, (1,2) and it value is 12

you can use the the numpy function add.at:

table = numpy.zeros((5,5))
points = [(1,2,12), (1,3,15), (1,4,6)]
for point in points:
     numpy.add.at(table, tuple(zip(i[0:2])), i[2])
np.rot90(table)

which leaves you with the output:

array([[ 0.,  6.,  0.,  0.,  0.],
       [ 0., 15.,  0.,  0.,  0.],
       [ 0., 12.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

you can pretty easily modife the code so it print the headers too

Upvotes: 2

Related Questions