krasnapolsky
krasnapolsky

Reputation: 357

regex split names with special characters (dash, apostrophe)

I have a column with names, and they are all concatenated (that is, there is no space between the first and last name). I am trying to split the first and last name, which has already been asked on this website. However here, some names have dashes \- or apostrophes \'.

Speed-WagonMario
CruiserPetey
SthesiaAnna
De’wayneJohn

I want to make sure it is catched by my regex query:

clean_names = re.split(r'([A-Z][a-z\']+\-[A-Z][a-z\']+|[A-Z][a-z\']+)', names)

It works for dashes, which happen only before an uppercase letter, but not for apostrophes.

Does anyone has an opinion on how to fix my query ? Thanks in advance

Upvotes: 0

Views: 254

Answers (1)

Mr. Polywhirl
Mr. Polywhirl

Reputation: 48640

You can combine a positive lookbehind (lower-case) with a positive lookahead (uppercase). Both of the matched lookarounds are kept when they are split.

/           // BEGIN EXPRESSION
(?<=[a-z])  // POSITIVE LOOKBEHIND [a-z]
(?=[A-Z])   // POSITIVE LOOKAHEAD  [A-Z]
/           // END EXPRESSION

Python Example

#!/usr/bin/env python3

import re

def pair_to_person(pair):
  person = {}
  person['firstName'] = pair[1]
  person['lastName'] = pair[0]
  return person

def parse_name_column(column_text):
  return map(pair_to_person,
    map(lambda name: re.split(r'(?<=[a-z])(?=[A-Z])', name),
      map(lambda x: x, column_text.strip().split('\n'))))

print_list = lambda list: print('\n'.join(map(str, list))) 

if __name__ == '__main__':
  column_text = '''
Speed-WagonMario
CruiserPetey
SthesiaAnna
De’wayneJohn
'''

  names = parse_name_column(column_text)

  print_list(names)

Output

{'firstName': 'Mario', 'lastName': 'Speed-Wagon'}
{'firstName': 'Petey', 'lastName': 'Cruiser'}
{'firstName': 'Anna', 'lastName': 'Sthesia'}
{'firstName': 'John', 'lastName': 'De’wayne'}

JS Example

const data = `
Speed-WagonMario
CruiserPetey
SthesiaAnna
De’wayneJohn
`;

const names = data.trim().split('\n')
  .map(name => name.trim().split(/(?<=[a-z])(?=[A-Z])/))
  .map(pair => ({ firstName: pair[1], lastName: pair[0] }));

console.log(names);
.as-console-wrapper { top: 0; max-height: 100% !important; }

Upvotes: 2

Related Questions