Reputation: 663
Whats the most efficient way to convert a list of interactions such as this:
QWERT ASDF 12
QWERT ZXCV 15
QWERT HJKL 6
: : :
ASDF-XYZ HJKL-XYY 123
into an all vs all matrix representation such as this:
QWERT ASDF ZXCV ... ASDF-XYZ
QWERT 0 12 15 ... 9
ASDF 12 0 45 ... 35
ZXCV 15 45 0 ... 24
: : : : : :
ASDF-XYZ 9 35 24 ... 0
It could be a few thousand up to several hundred thousands of features, so speed does matter.
Edit: The input is a csv file. Please note that the feature names are arbitrary (but unique) strings and that missing interaction should be represented as 0 in the output matrix. Made the example more clear.
Upvotes: 3
Views: 369
Reputation: 781
Since you're reading a CSV, you could use pandas and pivot
. This will not give you an n * n
array, but an n1 * n2
array, where n1
and n2
are the unique values in the first and second column, respectively.
import pandas as pd
# For exposition, replace with data.
df = pd.DataFrame([["XYZ", "ABC", 10],
["ASDF", "XYZ", 100],
["BSDF", "ABC", 1000]], columns=("id1", "id2", "value"))
pv = pd.pivot_table(df,
values="value",
index="id1",
columns="id2",
fill_value=0)
Upvotes: 1
Reputation: 2806
You can use numpy
for this
lets say the input:
points = [(1,2,12), (1,3,15), (1,4,6)]
the first point is on the cordinates, (1,2) and it value is 12
you can use the the numpy function add.at
:
table = numpy.zeros((5,5))
points = [(1,2,12), (1,3,15), (1,4,6)]
for point in points:
numpy.add.at(table, tuple(zip(i[0:2])), i[2])
np.rot90(table)
which leaves you with the output:
array([[ 0., 6., 0., 0., 0.],
[ 0., 15., 0., 0., 0.],
[ 0., 12., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
you can pretty easily modife the code so it print the headers too
Upvotes: 2