Reputation: 18614
I have a large list of clothes from an online shop.
Every entry consists of a main words (e.g. shirt, jacket) and one or more attributes (e.g. size, color).
Clothes could have the following attributes:
Let's have a look at an example.
SPO SHIRT L
CAS SHIRT M GRE
CAS SHIRT L RED
BUS SHIRT XS RED
CAS SHOES
SHOES BLACK
JACKET FEM M
JACKET FEM GRE
CAS JACKET MA RED
The items are populated by humans, so it is inconsistent. Not every entry has every attribute. Also they may be in different orders.
I now want to automatically recognize the main words and the attributes. Also I want to group the attributes in groups shown above.
An example output is just the groups of words.
And also the categories of the attributes:
Of course, the algorithm won't be able to provide suitable names for the categories (e.g. size, color).
What I already know:
What algorithms can I use in order to recognize those words?
One possible solution is to filter words with <=3
letters. Most likely those are the attributes. But this is a rather naive approach.
I use JavaScript, but I'm also grateful for pseudocode.
This is only a small extract of the list and the possible categories.
Update:
I do not know the attributes beforehand. The computer should extract them without my help.
Upvotes: 2
Views: 174
Reputation: 51246
Let's assume that every word is either a main word or an attribute word -- that is, there are no words that are sometimes used as main words, and sometimes as attributes. Then the problem is to classify every word as either main or attribute -- or equivalently, to determine the subset of all words that are main words (since the rest must be the attribute words). We don't know too much about attribute words, except that they should be pretty common. But we know something more definite about the subset of words that are main words: every product must contain exactly one word in this subset.
This problem can be modeled as finding an exact cover, where the ground set (the set of items that need to be "covered") is the set of all products, and every word gives us a set that we can potentially use to cover some of these ground-set elements: namely, the set of products that use that word. A solution to this problem will be a subset of words whose corresponding product sets together include every product exactly once -- that is, it will be a candidate set of main words.
Unfortunately, finding an exact cover is an NP-hard problem, so no efficient algorithms are known that will solve this problem exactly.
Nevertheless it's often the case that either an exact answer, or a good-quality heuristic answer, can be found by expressing the problem as an integer linear program, and using an ILP solver, such as the freely available lp_solve or the expensive (but much faster) CPLEX, to solve the problem. There are two approaches:
Conveniently, changing from one to the other is as simple as specifying whether the variables are constrained to be integers or not; the objective function and all the constraints don't change.
Suppose there are n distinct words, and m products. To formulate this problem as an (I)LP, we will need a variable x_j for each distinct word j, which the solver will assign the value 1 if the j-th word should be considered a main word, or 0 if it should be considered an attribute word (or something in between if we are only solving the continuous relaxation). Let p_i_j be 1 if product i uses word j (this information is present in the input, and can be represented as a matrix). Then for each product i, we need the constraint
p_i_1*x_1 + p_i_2*x_2 + ... + p_i_n*x_n = 1.
This forces product i to have exactly one main word. (Most words won't be present in a given product, so the corresponding term in the above expression will have a 0 value for p_i_j and can be omitted; doing this will reduce the size of the problem description considerably.)
Those constraints are all we actually need, but ILPs also give us an objective function to minimise or maximise. So we could for example find the largest (or smallest) set of main words by trying to maximise (or minimise) the function
x_1 + x_2 + ... + x_n
This formulation is pretty flexible: We can easily weight some words more highly, or prevent some words (e.g. words that appear too frequently) from being considered as main words by constraining their x_j values to 0.
Let's actually calculate a word set for the example data fragment provided.
The distinct words are:
1 BLACK
2 BUS
3 CAS
4 FEM
5 GRE
6 JACKET
7 L
8 M
9 MA
10 RED
11 SHIRT
12 SHOES
13 SPO
14 XS
[EDIT 7/3/2016]: We can express the ILP problem in CPLEX LP format, adding an objective function that (somewhat arbitrarily) minimises the total number of words chosen as main words:
Minimize
x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10 + x_11 + x_12 + x_13 + x_14
Subject To
x_13 + x_11 + x_7 = 1
x_3 + x_11 + x_8 + x_10 = 1
x_3 + x_11 + x_7 + x_10 = 1
x_2 + x_11 + x_14 + x_10 = 1
x_3 + x_12 = 1
x_12 + x_1 = 1
x_6 + x_4 + x_8 = 1
x_6 + x_4 + x_5 = 1
x_3 + x_6 + x_9 + x_10 = 1
Binary
x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 x_11 x_12 x_13 x_14
Submitting a text file containing the above code block to the online SCIP solver at NEOS, we get back (reportedly in 0.00s time!) an optimal solution which assigns the 3 variables x_6, x_11 and x_12 to 1, and the rest to 0: this corresponds to choosing JACKET, SHIRT and SHOES as main words.
In this case, choosing to minimise the number of words chosen as main words got us the answer we know to be correct, but in general there can be multiple sets of main words that satisfy the criteria, and it's not obvious how to choose which is best. For example, if we had decided to maximise the number of main words instead (as I tried initially, FWIW), we could have gotten the solution {BLACK, BUS, CAS, FEM, SPO} instead. In general though, the more input that is provided, the fewer satisfying sets of words there will be, and so the less the choice of objective function will matter.
After determining which words are main words, we want to break the remaining attribute words up into categories. This problem is even harder: Essentially all we can say is that whenever two words are from the same category, like CAS and BUS, they must not both appear in the same product. But if that is the only constraint we apply, then an easy solution is just to put every single attribute word into its own category. Of course that doesn't tell us anything useful, so to avoid this possibility, we can ask that the total number of categories be as small as possible.
What we then have is the (Minimum) Graph Colouring problem: For each attribute word, create a vertex, and create an edge between any two vertices whenever their words both appear in the same product. A graph colouring is an assignment of "colours" (or equivalently, simply categories) to the vertices (words) that always assigns different colours to the two endpoints of an edge. A minimum such colouring uses the fewest possible colours (categories).
There are several different ways to formulate a Graph Colouring problem as an ILP. You could use the one given here. Note that there will be many more variables than in the Exact Cover ILP, and with some formulations there will be as many as k! different ways to encode a particular breakdown into k categories (essentially, as many category assignments as there are ways to order k items), so solving this problem to optimality will likely take much longer.
Upvotes: 1
Reputation: 2516
UPDATE:
Remember: In such cases, the more the data is the more accurate an algorithm becomes. But if you only have few rows with many attributes, then almost every algorithm will fail. Also the >=3 thing will fail for BLACK and it the product name is TIE, so we will have to look for a good algorithm.
Now here's my take on the problem
Parse the data and create a matrix of each unique word for each item against other unique words. This will give the stats that which word is used in which order placement, and is related to which words in that order. This will also give the stats about how frequently a word is used in an order.
I would suggest to create or (import from somewhere) a list of the human identifiable color names, because color names are global, and if its blue, then one would write it BLUE. This would help identify the colors.
Once colors are identified, we can probably look for gender. Of course, if one is a FEMALE, probability of writing it would be F or FEM or FEMALE. So we will have to find the closest match in for the gender name by any string matching algorithm.
Now target <=2 character words, which will give you the sizes.
Afterwards look for the most frequent words from the remaining words, frequency of occurrence of each frequency would be fairly high (because there are limited categories), so set a threshold of minimum occurrences to identify something as a category.
At last the remaining words would be the item names (Shirt, shoes, etc)
The only error that could occur here is when the data is very less or the frequency threshold is not optimized correctly.
MY_PREVIOUS_ANSWER: (when i thought that you just have to parse the input, but this still can be used to create the matrix).
I think you can work it out by creating maps and iterating over them, like
(function () {
"use strict";
const text = `SPO SHIRT L
CAS SHIRT M GRE
CAS SHIRT L RED
BUS SHIRT XS RED
CAS SHOES
SHOES BLACK
JACKET FEM M
JACKET FEM GRE
CAS JACKET MA RED`;
const tokens = {
size_list: "XS, S, M, L, XL".split(', '),
color_list: "GREEN, RED, BLUE, BLACK".split(', ').map((c) => c.slice(0, 3)),
category_list: "SPORT, BUSINESS, CASUAL".split(', ').map((c) => c.slice(0, 3)),
gender_list: ["MA", "FEM", "UNI"],
item_list: "SHOES, SHIRT, JACKET".split(', ')
};
let items = text.split(/\n/).filter((line) => !!line.trim()).map((line) => {
line = line.trim();
let item = {};
line.split(' ').forEach((w) => {
Object.keys(tokens).forEach((t) => {
tokens[t].includes(w) && (item[t.replace('_list', '')] = w);
});
});
return item;
})
document.write(JSON.stringify(items, 0, 4));
}());
Upvotes: -1