user3639124
user3639124

Reputation:

MySQL Count number of substring occurrences in column

Using MySQL, I'm trying to count the number of occurrences of a substring in a column, for each row.

In the example table below, the string "art" appears three times in the Terms column, so it would have a count of 3.

Example Table:

TERMS
art
artistic
painting
elephant
art deco
paint
paintings

Desired Output:

TERMS      COUNT
art        3
artistic   1
painting   2
elephant   1
art deco   1
paint      2
paintings  1

Edits:

As a starting point, I know that:

SELECT terms, COUNT(*)
FROM table
GROUP BY terms

will output number of occurrences for each full term string. For substring matching, I think this may involve a subquery.

Tried the following, but every count was 1.

SELECT terms, ROUND((CHAR_LENGTH(terms) - CHAR_LENGTH(REPLACE(terms, terms, ""))) / CHAR_LENGTH(terms)) AS count
FROM table
GROUP BY terms

Upvotes: 0

Views: 6144

Answers (1)

spencer7593
spencer7593

Reputation: 108400

I would write this by first writing a query that just returns a list of terms we want to return. For example:

 SELECT t.terms
   FROM `table` t
  GROUP BY t.terms

Then wrap that in parens and use it as an inline view...

SELECT w.terms
  FROM ( SELECT t.terms
           FROM `table` t
          GROUP BY t.terms
       ) w
 ORDER BY w.terms

With that, we can do a join operation to look for matching rows, and get a count. Assuming a guarantee that terms doesn't contain underscore (_) or percent (%) characters, we can use a LIKE comparison.

Given that every term in our list is going to appear at least one time, we can use an inner join. In the more general case, where we might expect to return a zero count, we would use an outer join.

SELECT w.terms
     , COUNT(1) AS `COUNT`
  FROM ( SELECT t.terms
           FROM `table` t
          GROUP BY t.terms
       ) w
  JOIN `table` c
    ON c.terms LIKE CONCAT('%', w.terms ,'%')
 GROUP BY w.terms
 ORDER BY w.terms

In the LIKE comparison, the percent signs are wildcards that match any characters (zero, one or more).

If there's a possibility that terms does contain underscore or percent characters, we can escape those so they aren't considered wildcards by the LIKE comparison. An expression like this should do the trick:

 REPLACE(REPLACE( w.terms ,'_','\_'),'%','\%')

So we'd have a query like this:

SELECT w.terms
     , COUNT(1) AS `COUNT`
  FROM ( SELECT t.terms
           FROM `table` t
          GROUP BY t.terms
       ) w
  JOIN `table` c
    ON c.terms LIKE CONCAT('%',REPLACE(REPLACE( w.terms ,'_','\_'),'%','\%'),'%')
 GROUP BY w.terms
 ORDER BY w.terms

There are other query patterns that will return the specified result. This is just a demonstration of one approach.


NOTE: In the example in the question, every terms that is a substring of another terms, the substring match appears at the beginning of the terms. This query will also find matches where the term isn't at the beginning.

e.g. dartboard would be considered a match to art

The query could be modified to match terms that appear only at the beginning of other terms.

FOLLOWUP

With the example data, returns:

terms      COUNT     matched_terms
---------  --------  -------------------------
art               3  art,art deco,artistic
art deco          1  art deco
artistic          1  artistic
elephant          1  elephant
paint             3  paint,painting,paintings
painting          2  painting,paintings
paintings         1  paintings

In addition to the COUNT(1) aggregate, I also included another expression in the select list. This isn't required, but it does give some additional information about which terms were considered to be matches.

 GROUP_CONCAT(DISTINCT c.terms ORDER BY c.terms) AS `matched_terms`

NOTE: If there's a possibility that terms contains backslash characters, we can escape those characters as well, using another REPLACE

 REPLACE(REPLACE(REPLACE( w.terms ,'\\','\\\\'),'_','\_'),'%','\%')
                 ^^^^^^^^         ^^^^^^^^^^^^^

Upvotes: 3

Related Questions