LaLuna Kon
LaLuna Kon

Reputation: 23

Regex function to split words and numbers in a hashtag in a sentence

I need a regex function to recognize a hashtag in a sentence, split the words and numbers in the hashtag and put the word 'hashtag' behind the hashtag. For example:

As you can see the words need to be split after before every capital and every number. However, 2015 can not be 2 0 1 5.

I already have the following:

r"(#)([A-Za-z]*|\d*)", r" \1hashtag \2 "

With output: #hashtag MainauDeclaration 2015 watch out guys.. This is HUGE!! #hashtag LindauNobel #hashtag SemST

I already have the following:

document = re.sub(r"(#)([A-Za-z]*|\d*)", r" \1hashtag \2 ", document)

With output: #hashtag MainauDeclaration 2015 watch out guys.. This is HUGE!! #hashtag LindauNobel #hashtag SemST.

Upvotes: 2

Views: 67

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627488

You can use

import re
text = "#MainauDeclaration2015 watch out guys.. This is HUGE!! #LindauNobel #SemST"
print( re.sub(r'#(\w+)', lambda x: '#hashtag ' + re.sub(r'(?!^)(?=[A-Z])|(?<=\D)(?=\d)|(?<=\d)(?=\D)', ' ', x.group(1)), text) )
# => #hashtag Mainau Declaration 2015 watch out guys.. This is HUGE!! #hashtag Lindau Nobel #hashtag Sem S T

See the Python demo.

The #(\w+) regex used with the first re.sub matches a # + any one or more word chars captured into Group 1.

The re.sub(r'(?!^)(?=[A-Z])|(?<=\D)(?=\d)|(?<=\d)(?=\D)', ' ', x.group(1)) part takes the Group 1 value as input and inserts a space between a non-digit and a digit, a digit and a non-digit and before a non-initial uppercase letter.

Upvotes: 0

Related Questions