Reputation: 25
I have a Google spreadsheet with two columns: A and C. Column A has over a thousand rows of names/subjects in their own cells, and Column C has several hundred rows with a variable combination of the above names/subjects in its various cells, with each name/subject separated from the others by a semicolon. Each of these columns gains new entries frequently throughout the course of the day.
As a very simplified visual example of the setup:
**Column A: Names | Column B: Occurrences | Column C: List**
A2: Adam | B2: [Blank] | C2: Charles; Adam
A3: Bob | B3: [Blank] | C3: Adam
A4: Charles | B4: [Blank] | C4: Smith, Charles
A5: Smith, Charles | B5: [Blank] | C5: Bob Evans
A6: Bob Evans | B6: [Blank] | C6: Smith, Charles; Charles; Bob
A7: [etc.] | B7: [Blank] | C7: Bob Evans; Charles; Bob
A8: [etc.] | B8: [Blank] | C8: [etc.]
Currently, I’m using the following formula to count the number of times that each string from Column A (here, A2) appears as a substring in Column C (here, C2 through C7):
=ARRAYFORMULA(IF(A2="","",(SUMPRODUCT(REGEXMATCH(REGEXREPLACE(REGEXREPLACE($C$2:$C$7,"([\(\)\?])", ""),"(\w+),{0,1}\s+(\w+)","$1$2"),".*(^|\s)"& trim(REGEXREPLACE(REGEXREPLACE($A2,"([\(\)\?])", ""),"(\w+),{0,1}\s+(\w+)","$1$2"))&"(;|$).*")))))
This gives the correct totals, but it seems to be incredibly processing heavy once you scale up; changing or adding any one entry to Column C causes the entire sheet to recalculate its thousands of entries, and it takes several minutes for it to turn out the new totals. Many of the REGEXREPLACE values are used here because some of the entries have punctuation like “()” and “?”, on account of cells like “Erōs”, “Olympic Games (23rd : 1984 : Los Angeles, Calif.)”, and “Thomas, Aquinas, Saint, 1225?-1274”.
The closest counting alternative that I’ve come up with is the following:
=SUMPRODUCT((LEN(C$2:C$7)-LEN(SUBSTITUTE(C$2:C$7,A2,"")))/LEN(A2))
Testing shows that this much simpler formula can recalculate the whole sheet in a few seconds, but it doesn’t actually count the entries correctly. In the above example of C2-C7, it would give totals of 4 and 5 for “Bob” and “Charles” because it doesn’t distinguish between “Bob” and “Bob Evans” or “Charles” and “Smith, Charles.” It should properly find 2 and 3 respectively.
Is there an effective way to adjust the above formula or create a new one which will count all the substrings properly—limiting the sums to exact matches to Column A as they're found between semicolons in Column C—without causing the sheet’s calculations to freeze up for several minutes at a time? Regular expressions were the route I first ended up taking, but I expect that those operations are the reason that it takes so long.
Upvotes: 0
Views: 180
Reputation: 50644
Try this too:
Somewhere in
Y1:
=QUERY(ARRAYFORMULA(TRIM(TRANSPOSE(SPLIT(CONCATENATE(SPLIT(C2:C6,";")& "🎾"), "🎾")))), "select Col1,Count (Col1) group by Col1")
Upvotes: 1
Reputation: 83
This might get you started in the right direction:
=QUERY(C$2:C,"SELECT count(C) WHERE C CONTAINS ('"&A2&"') OR C CONTAINS upper('"&A2&"') OR C CONTAINS lower('"&A2&"')", -1)
Upvotes: 1