user3142695
user3142695

Reputation: 17332

Get initials and full last name from a string containing names

Assume there are some strings containing names in different format (each line is a possible user input):

'Guilcher, G.M., Harvey, M. & Hand, J.P.'
'Ri Liesner, Peter Tom Collins, Michael Richards'
'Manco-Johnson M, Santagostino E, Ljung R.'

I need to transform those names to get the format Lastname ABC. So each surename should be transformed to its initial which are appended to the lastname.

The example should result in

Guilcher GM, Harvey M, Hand JP
Liesner R, Collins PT, Richards M
Manco-Johnson M, Santagostino E, Ljung R

The problem is the different (possible) input format. I think my attempts are not very smart, so I'm asking for

  1. Some hints to optimize the transformation code
  2. How do I put those in a single function at all? I think first of all I have to test which format the string has...??

So let me explain how far I tried to solve that:

First example string

In the first example there are initials followed by a dot. The dots should be removed and the comma between the name and the initals should be removed.

firstString
  .replace('.', '')
  .replace(' &', ', ')

I think I do need an regex to get the comma after the name and before the initials.

Second example string

In the second example the name should be splitted by space and the last element is handled as lastname:

const elm = secondString.split(/\s+/)
const lastname = elm[elm.length - 1]
const initials = elm.map((n,i) => {
  if (i !== elm.length - 1) return capitalizeFirstLetter(n)
})

return lastname + ' ' + initals.join('')

...not very elegant

Third example string

The third example has the already the correct format - only the dot at the end has to be removed. So nothing else has to be done with that input.

Upvotes: 3

Views: 4141

Answers (3)

Jorjon
Jorjon

Reputation: 5434

Here's my approach. I tried to keep it short but complexity was surprisingly high to get the edge cases.

  • First I'm formatting the input, to replace & for ,, and removing ..
  • Then, I'm splitting the input by \n, then , and finally (spaces).
  • Next I'm processing the chunks. On each new segment (delimited by ,), I process the previous segment. I do this because I need to be sure that the current segment isn't an initial. If that's the case, I do my best to skip that inital-only segment and process the previous one. The previous one will have the correct initial and surname, as I have all the information I neeed.
  • I get the initial on the segment if there's one. This will be used on the start of the next segment to process the current one.
  • After finishing each line, I process again the last segment, as it wont be called otherwise.

I understand the complexity is high without using regexp, and probably would have been better to use a state machine to parse the input instead.

const isInitial = s => [...s].every(c => c === c.toUpperCase());
const generateInitial = arr => arr.reduce((a, c, i) => a + (i < arr.length - 1 ? c[0].toUpperCase() : ''), '');
const formatSegment = (words, initial) => {
  if (!initial) {
    initial = generateInitial(words);
  }
  const surname = words[words.length - 1];
  return {initial, surname};
}

const doDisplay = x => x.map(x => x.surname + ' ' + x.initial).join(', ');

const doProcess = _ => {
  const formatted = input.value.replace(/\./g, '').replace(/&/g, ',');
  const chunks = formatted.split('\n').map(x => x.split(',').map(x => x.trim().split(' ')));
  const peoples = [];
  chunks.forEach(line => {
    let lastSegment = null;
    let lastInitial = null;
    let lastInitialOnly = false;
    line.forEach(segment => {
      if (lastSegment) {
        // if segment only contains an initial, it's the initial corresponding
        // to the previous segment
        const initialOnly = segment.length === 1 && isInitial(segment[0]);
        if (initialOnly) {
          lastInitial = segment[0];
        }
        // avoid processing last segments that were only initials
        // this prevents adding a segment twice
        if (!lastInitialOnly) {
          // if segment isn't an initial, we need to generate an initial
          // for the previous segment, if it doesn't already have one
          const people = formatSegment(lastSegment, lastInitial);
          peoples.push(people);
        }
        lastInitialOnly = initialOnly;
        
        // Skip initial only segments
        if (initialOnly) {
          return;
        }
      }
      lastInitial = null;
      
      // Remove the initial from the words
      // to avoid getting the initial calculated for the initial
      segment = segment.filter(word => {
        if (isInitial(word)) {
          lastInitial = word;
          return false;
        }
        return true;
      });
      lastSegment = segment;
    });
    
    // Process last segment
    if (!lastInitialOnly) {
      const people = formatSegment(lastSegment, lastInitial);
      peoples.push(people);
    }
  });
  return peoples;
}
process.addEventListener('click', _ => {
  const peoples = doProcess();
  const display = doDisplay(peoples);
  output.value = display;
});
.row {
  display: flex;
}

.row > * {
  flex: 1 0;
}
<div class="row">
  <h3>Input</h3>
  <h3>Output</h3>
</div>
<div class="row">
  <textarea id="input" rows="10">Guilcher, G.M., Harvey, M. & Hand, J.P.
Ri Liesner, Peter Tom Collins, Michael Richards
Manco-Johnson M, Santagostino E, Ljung R.
Jordan M, Michael Jackson & Willis B.</textarea>
  <textarea id="output" rows="10"></textarea>
</div>
<button id="process" style="display: block;">Process</button>

Upvotes: 1

revo
revo

Reputation: 48711

It wouldn't be possible without calling multiple replace() methods. The steps in provided solution is as following:

  • Remove all dots in abbreviated names
  • Substitute lastname with firstname
  • Replace lastnames with their beginning letter
  • Remove unwanted characters

Demo:

var s = `Guilcher, G.M., Harvey, M. & Hand, J.P.
Ri Liesner, Peter Tom Collins, Michael Richards
Manco-Johnson M, Santagostino E, Ljung R.`

// Remove all dots in abbreviated names
var b = s.replace(/\b([A-Z])\./g, '$1')
// Substitute first names and lastnames
.replace(/([A-Z][\w-]+(?: +[A-Z][\w-]+)*) +([A-Z][\w-]+)\b/g, ($0, $1, $2) => {
    // Replace full lastnames with their first letter
    return $2 + " " + $1.replace(/\b([A-Z])\w+ */g, '$1');
})
// Remove unwanted preceding / following commas and ampersands 
.replace(/(,) +([A-Z]+)\b *[,&]?/g, ' $2$1');

console.log(b);

Upvotes: 3

wiesion
wiesion

Reputation: 2445

Given your example data i would try to make guesses based on name part count = 2, since it is very hard to rely on any ,, & or \n - which means treat them all as ,.

Try this against your data and let me know of any use-cases where this fails because i am highly confident that this script will fail at some point with more data :)

let testString = "Guilcher, G.M., Harvey, M. & Hand, J.P.\nRi Liesner, Peter Tom Collins, Michael Richards\nManco-Johnson M, Santagostino E, Ljung R.";

const inputToArray = i => i
    .replace(/\./g, "")
    .replace(/[\n&]/g, ",")
    .replace(/ ?, ?/g, ",")
    .split(',');

const reducer = function(accumulator, value, index, array) {
    let pos = accumulator.length - 1;
    let names = value.split(' ');
    if(names.length > 1) {
        accumulator.push(names);
    } else {
        if(accumulator[pos].length > 1) accumulator[++pos] = [];
        accumulator[pos].push(value);
    }
    return accumulator.filter(n => n.length > 0);
};

console.log(inputToArray(testString).reduce(reducer, [[]]));

Upvotes: 1

Related Questions