Vitaliy
Vitaliy

Reputation: 137

Algorithm for computing each feature costs

I have to build some complicated system, having the following problem as it`s part (this is more or less formalized description): Let's assume, we have a set of some objects O = {objects}, and a list of features, that each object can contain F = {features}. So, we can think of each object as a list of features that it has: object o = {f1, f3, f15}. Each object has also the integer value. Now, we have to find the average and the median value of each feature. It is acceptable to solve the problem approximately (I have a feeling the the possible algorithm can have exponential complexity).

The setcan be large (like 10^5 elements). We can also imagine all objects in the database table like this:

objectid|value|f1|f2|f3|f4|...|f30
100     |3456 |0 |1 |0 |1 |...|0
101     |61234|0 |0 |1 |1 |...|1 
102     |8761 |1 |0 |0 |1 |...|0 
.........................
9999    |8080 |1 |1 |0 |0 |...|1

if we had a small amount of elements, it would be possible to build a system of linear equations and solve them. But this obviously won't work for thouthands elements.

Any ideas how to proceed?

Addition: Sample. Let's proceed with a simple, artificial example. Let's say we have some object type on the market with features 0 to 3 (i.e. tool boxes with hammer, screwdriver, set of drills and chisel). we have the following objects on the market, saved to the db table:

object| f0 | f1 | f2 | f3 | price
obj0  | 1  | 1  | 0  | 0  | 700
obj1  | 1  | 1  | 0  | 0  | 750
obj2  | 1  | 1  | 1  | 0  | 950
obj3  | 1  | 1  | 1  | 0  | 1200
obj4  | 0  | 1  | 1  | 1  | 980
obj5  | 0  | 1  | 1  | 1  | 1020
obj6  | 0  | 1  | 1  | 0  | 790
obj7  | 0  | 1  | 1  | 0  | 820
obj8  | 1  | 0  | 1  | 0  | 690
obj9  | 1  | 0  | 1  | 0  | 780

then we can easily compute the average price of each feature: First, we group elements by feature list, then for each group we compute the average price. Then for each feature, we find all groups, that distinguishes by that feature only. we find the price difference between the groups, and then we find the average of all these differences.

For example, for f0: 1. "group elements by feature list" {f0, f1} -> {obj0, obj1} {f0, f1, f2} -> {obj2, obj3} {f1, f2, f3} -> {obj4, obj5} {f1, f2} -> {obj6, obj7} {f0, f2} -> {obj8, obj9}

  1. "then for each group we compute the average price" {f0, f1} -> 725 {f0, f1, f2} -> 1075 {f1, f2, f3} -> 1000 {f1, f2} -> 805 {f0, f2} -> 735

  2. "we find all groups, that distinguishes by that feature only. " we can obtain feature f0 only once: {f0, f1, f2} minus {f1, f2}.

  3. "we find the price difference between the groups" {f0, f1, f2} costs 1075, {f1, f2} costs 805, so the feature f0 costs 1075-805 = 270.

  4. "we find the average of all these differences" the price will be 270.

In the same way we can compute other prices: f1 costs 340, f2 costs 350, f3 costs 195 in average.

Now, let’s say, I would like to bring to the market a new tool box, with features f0, f1, f3. I can say, that the average cost should be 805. I understand, that such an approach is very trivial. I would appreciate any advices on mathematical/algorithmic approaches for such type of tasks.

Upvotes: 1

Views: 72

Answers (1)

Sneftel
Sneftel

Reputation: 41503

Since the problem is underconstrained (in that features can apparently have different values in each object), no definite average or median value can be calculated for individual features in the general case.

One approach would be to find the Moore-Penrose pseudoinverse of the feature matrix, then multiply it by the object-value vector; that will result in a least-squares solution for the feature values, such that the total squared error for object values is minimized. This isn't lightning-fast, but for only 10^5 objects it should be fine, assuming you use a well-optimized implementation.

Upvotes: 1

Related Questions