Reputation:

count of word occurrences in a list, case insensitive

What is the most professional way to obtain a case insensitive count of the distinct words contained in an array using plain javascript? I have done the first attempt myself but does not feel much professional.

I would like the result to be a Map

Upvotes: 1

Answers (4)

pilchard

Reputation: 12929

While the accepted 'group-by' operation is fine, it doesn't address the complexity of case-insensitive/unicode comparison.

First of all, you can reduce directly into a Map, here counting characters as they are without accounting for case-insensitivity or unicode variations resulting in 20 'distinct' characters being counted from an array of length 24.

const input = [ 'a', 'A', 'b', 'B', '\u00F1', '\u006E\u0303', 'İ', 'i', 'Gesäß',
  'GESÄSS', '\u0399', '\u1FBE', '\u00E5', '\u212B', '\u00C5', '\u212B', '\u0399', '\u1FBE', '\u03B9', '\u1FBE', '\u03B2', '\u03D0', '\u03B5', '\u03F5', ];

const result = input.reduce((a, b) => a.set(b, (a.get(b) ?? 0) + 1), new Map());

console.log('distinct count: ', result.size); console.log('Map(',result.size,') {', [...result.entries()].map(([k, v]) => `${k} => ${v}`).join(', '), '}');

Based on the samples below, the method that results in the most compact count is using word.normalize().toLocaleUpperCase() and passing Turkey('tr') as a locale for this specific sample array. It results in 9 'distinct' characters being counted from an array of length 24, properly handling different encodings for ñ, equivalent spellings of Gesäß(GESÄSS), and accounting for locale specific case changes (i to İ)

const input = [ 'a', 'A', 'b', 'B', '\u00F1', '\u006E\u0303', 'İ', 'i', 'Gesäß',
  'GESÄSS', '\u0399', '\u1FBE', '\u00E5', '\u212B', '\u00C5', '\u212B', '\u0399', '\u1FBE', '\u03B9', '\u1FBE', '\u03B2', '\u03D0', '\u03B5', '\u03F5', ];

const result_normalize_locale = input.reduce((a, b) => {
  const w = b.normalize().toLocaleUpperCase('tr');

  return a.set(w, (a.get(w) ?? 0) + 1);
}, new Map());

console.log('distinct count: ', result_normalize_locale.size); console.log('Map(',result_normalize_locale.size,') {', [...result_normalize_locale.entries()].map(([k, v]) => `${k} => ${v}`).join(', '), '}');

Using this simple 'group-by' we can look at the variations between the available case comparison methods: toLowerCase(), toLocaleLowerCase(), toUpperCase(), and toLocaleUpperCase() and unicode variations can be accounted for using normalize()

To lower case

toLowerCase() – 15 'distinct' characters.

toLocaleLowerCase() – 14 'distinct' characters, in this case specifying Turkey('tr') as locale.

normalize().toLocaleLowerCase() – 12 'distinct' characters, again with 'tr' as locale.

const input = [ 'a', 'A', 'b', 'B', '\u00F1', '\u006E\u0303', 'İ', 'i', 'Gesäß',
  'GESÄSS', '\u0399', '\u1FBE', '\u00E5', '\u212B', '\u00C5', '\u212B', '\u0399', '\u1FBE', '\u03B9', '\u1FBE', '\u03B2', '\u03D0', '\u03B5', '\u03F5', ];
// ['a', 'A', 'b', 'B', 'ñ', 'ñ', 'İ', 'i', 'Gesäß', 'GESÄSS', 'Ι', 'ι', 'å', 'Å', 'Å', 'Å', 'Ι', 'ι', 'ι', 'ι', 'β', 'ϐ', 'ε', 'ϵ', ]
// input.length: 24

// grouping by toLowerCase()
const result = input.reduce((a, b) => {
  const w = b.toLowerCase();

  return a.set(w, (a.get(w) ?? 0) + 1);
}, new Map());

// grouping by toLocaleLowerCase('tr') [Turkey]
const result_locale = input.reduce((a, b) => {
  const w = b.toLocaleLowerCase('tr');

  return a.set(w, (a.get(w) ?? 0) + 1);
}, new Map());

// grouping by normalize().toLocaleLowerCase('tr') [Turkey]
const result_normalize_locale = input.reduce((a, b) => {
  const w = b.normalize().toLocaleLowerCase('tr');

  return a.set(w, (a.get(w) ?? 0) + 1);
}, new Map());

// log toLowerCase() result - 15 'distinct' characters
console.log('toLowerCase() '); console.log('distinct count: ', result.size); console.log('Map(',result.size,') {', [...result.entries()].map(([k, v]) => `${k} => ${v}`).join(', '), '}');

// log toLocaleLowerCase('tr') result - 14 'distinct' characters
console.log("\ntoLocaleLowerCase('tr')"); console.log('distinct count: ', result_locale.size); console.log('Map(',result_locale.size,') {', [...result_locale.entries()].map(([k, v]) => `${k} => ${v}`).join(', '), '}');

// log normalize().toLocaleLowerCase('tr') result - 12 'distinct' characters
console.log("\nnormalize().toLocaleLowerCase('tr')"); console.log('distinct count: ', result_normalize_locale.size); console.log('Map(',result_normalize_locale.size,') {', [...result_normalize_locale.entries()].map(([k, v]) => `${k} => ${v}`).join(', '), '}');

.as-console-wrapper { max-height: 100% !important; top: 0; }

To upper case

toUpperCase() – 12 'distinct' characters.

toLocaleUpperCase() – 11 'distinct' characters, in this case specifying Turkey('tr') as locale.

normalize().toLocaleUpperCase() – 9 'distinct' characters, again with 'tr' as locale.

const input = [ 'a', 'A', 'b', 'B', '\u00F1', '\u006E\u0303', 'İ', 'i', 'Gesäß',
  'GESÄSS', '\u0399', '\u1FBE', '\u00E5', '\u212B', '\u00C5', '\u212B', '\u0399', '\u1FBE', '\u03B9', '\u1FBE', '\u03B2', '\u03D0', '\u03B5', '\u03F5', ];
// ['a', 'A', 'b', 'B', 'ñ', 'ñ', 'İ', 'i', 'Gesäß', 'GESÄSS', 'Ι', 'ι', 'å', 'Å', 'Å', 'Å', 'Ι', 'ι', 'ι', 'ι', 'β', 'ϐ', 'ε', 'ϵ', ]
// input.length: 24

// grouping by toUpperCase() 
const result = input.reduce((a, b) => {
  const w = b.toUpperCase();

  return a.set(w, (a.get(w) ?? 0) + 1);
}, new Map());

// grouping by toLocaleUpperCase('tr') [Turkey]
const result_locale = input.reduce((a, b) => {
  const w = b.toLocaleUpperCase('tr');

  return a.set(w, (a.get(w) ?? 0) + 1);
}, new Map());

// grouping by normalize().toLocaleUpperCase('tr') [Turkey]
const result_normalize_locale = input.reduce((a, b) => {
  const w = b.normalize().toLocaleUpperCase('tr');

  return a.set(w, (a.get(w) ?? 0) + 1);
}, new Map());

// log toUpperCase() result - 12 'distinct' characters
console.log('toUpperCase() '); console.log('distinct count: ', result.size); console.log('Map(',result.size,') {', [...result.entries()].map(([k, v]) => `${k} => ${v}`).join(', '), '}');

// log toLocaleUpperCase('tr') result - 11 'distinct' characters
console.log("\ntoLocaleUpperCase('tr')"); console.log('distinct count: ', result_locale.size); console.log('Map(',result_locale.size,') {', [...result_locale.entries()].map(([k, v]) => `${k} => ${v}`).join(', '), '}');

// log normalize().toLocaleUpperCase('tr') result - 9 'distinct' characters
console.log("\nnormalize().toLocaleUpperCase('tr')"); console.log('distinct count: ', result_normalize_locale.size); console.log('Map(',result_normalize_locale.size,') {', [...result_normalize_locale.entries()].map(([k, v]) => `${k} => ${v}`).join(', '), '}');

.as-console-wrapper { max-height: 100% !important; top: 0; }

Upvotes: 0

Ran Turner

Reputation: 18116

You can use an object to store the results and then create a Map object by passing that object to Object.entries

const arr = ["c", "A", "C", "B", "b"];

const counts = {};
for (const el of arr) {
  let c = el.toLowerCase();
  counts[c] = counts[c] ? ++counts[c] : 1;
}

console.log(counts);

const map = new Map(Object.entries(counts))
map.forEach((k,v) => console.log(k,v))

Upvotes: 0

Spectric

Reputation: 31992

You can use Array.reduce to store each word as a property and the occurrence of each as the value.

In the reducer function, check whether the letter (converted to lowercase) exists as a property. If not, set its value to 1. Otherwise, increment the property value.

const arr = ["a", "A", "b", "B"]

const result = arr.reduce((a,b) => {
  let c = b.toLowerCase();
  return a[c] = a[c] ? ++a[c] : 1, a;
}, {})

console.log(result)

_{As a one liner: const result = arr.reduce((a,b) => (c = b.toLowerCase(), a[c] = a[c] ? ++a[c] : 1, a), {})}

To convert it to a Map, you can use Object.entries (sugged by @Théophile):

const arr = ["a", "A", "b", "B"]

const result = arr.reduce((a, b) => {
  let c = b.toLowerCase();
  return a[c] = a[c] ? ++a[c] : 1, a;
}, {})

const m = new Map(Object.entries(result))
m.forEach((value, key) => console.log(key, ':', value))

Upvotes: 2

Bryan Dellinger

Reputation: 5304

use set to get rid of duplicates and the spread operator to put it back in an array.

const  myarray = ['one', 'One', 'two', 'TWO', 'three'];
const noDupes = [... new Set( myarray.map(x => x.toLowerCase()))];
console.log(noDupes);

Upvotes: -1

count of word occurrences in a list, case insensitive

Answers (4)

Related Questions