Reputation:
Using MySQL, I'm trying to count the number of occurrences of a substring in a column, for each row.
In the example table below, the string "art" appears three times in the Terms column, so it would have a count of 3.
Example Table:
TERMS
art
artistic
painting
elephant
art deco
paint
paintings
Desired Output:
TERMS COUNT
art 3
artistic 1
painting 2
elephant 1
art deco 1
paint 2
paintings 1
Edits:
As a starting point, I know that:
SELECT terms, COUNT(*)
FROM table
GROUP BY terms
will output number of occurrences for each full term string. For substring matching, I think this may involve a subquery.
Tried the following, but every count was 1.
SELECT terms, ROUND((CHAR_LENGTH(terms) - CHAR_LENGTH(REPLACE(terms, terms, ""))) / CHAR_LENGTH(terms)) AS count
FROM table
GROUP BY terms
Upvotes: 0
Views: 6144
Reputation: 108400
I would write this by first writing a query that just returns a list of terms we want to return. For example:
SELECT t.terms
FROM `table` t
GROUP BY t.terms
Then wrap that in parens and use it as an inline view...
SELECT w.terms
FROM ( SELECT t.terms
FROM `table` t
GROUP BY t.terms
) w
ORDER BY w.terms
With that, we can do a join operation to look for matching rows, and get a count. Assuming a guarantee that terms
doesn't contain underscore (_
) or percent (%
) characters, we can use a LIKE
comparison.
Given that every term in our list is going to appear at least one time, we can use an inner join. In the more general case, where we might expect to return a zero count, we would use an outer join.
SELECT w.terms
, COUNT(1) AS `COUNT`
FROM ( SELECT t.terms
FROM `table` t
GROUP BY t.terms
) w
JOIN `table` c
ON c.terms LIKE CONCAT('%', w.terms ,'%')
GROUP BY w.terms
ORDER BY w.terms
In the LIKE
comparison, the percent signs are wildcards that match any characters (zero, one or more).
If there's a possibility that terms
does contain underscore or percent characters, we can escape those so they aren't considered wildcards by the LIKE comparison. An expression like this should do the trick:
REPLACE(REPLACE( w.terms ,'_','\_'),'%','\%')
So we'd have a query like this:
SELECT w.terms
, COUNT(1) AS `COUNT`
FROM ( SELECT t.terms
FROM `table` t
GROUP BY t.terms
) w
JOIN `table` c
ON c.terms LIKE CONCAT('%',REPLACE(REPLACE( w.terms ,'_','\_'),'%','\%'),'%')
GROUP BY w.terms
ORDER BY w.terms
There are other query patterns that will return the specified result. This is just a demonstration of one approach.
NOTE: In the example in the question, every terms
that is a substring of another terms
, the substring match appears at the beginning of the terms. This query will also find matches where the term isn't at the beginning.
e.g. dartboard
would be considered a match to art
The query could be modified to match terms
that appear only at the beginning of other terms
.
FOLLOWUP
With the example data, returns:
terms COUNT matched_terms
--------- -------- -------------------------
art 3 art,art deco,artistic
art deco 1 art deco
artistic 1 artistic
elephant 1 elephant
paint 3 paint,painting,paintings
painting 2 painting,paintings
paintings 1 paintings
In addition to the COUNT(1)
aggregate, I also included another expression in the select list. This isn't required, but it does give some additional information about which terms were considered to be matches.
GROUP_CONCAT(DISTINCT c.terms ORDER BY c.terms) AS `matched_terms`
NOTE: If there's a possibility that terms
contains backslash characters, we can escape those characters as well, using another REPLACE
REPLACE(REPLACE(REPLACE( w.terms ,'\\','\\\\'),'_','\_'),'%','\%')
^^^^^^^^ ^^^^^^^^^^^^^
Upvotes: 3