codeBarer
codeBarer

Reputation: 2378

Extracting hashtags out of a string.

If I had a string as such

var comment =  "Mmmm #yummy #donut at #CZ"

How can I get a list of hash tags that exist in the string variable?

I tried using JavaScript split() method but I have to keep splitting all strings created from the initial split string. Is there a simpler way of doing this?

Upvotes: 13

Views: 13668

Answers (7)

Chanchala Gorale
Chanchala Gorale

Reputation: 151

content.split(/[\s\n\r]/gim).filter(tag => tag.startsWith('#'))

Upvotes: 0

Nikolay Schamberg
Nikolay Schamberg

Reputation: 324

If you need a character of any alphabet within hashtag, I'd go with something like this:

let text = "улетные #выходные // #holiday in the countryside";
const hashtags = []
if (text.length) {
    let preHashtags = text.split('#')
    let i = 0;
    if (text[0] !== '#') i++ 

    for (null; i < preHashtags.length; i++) {
        let item = preHashtags[i]
        hashtags.push(item.split(' ')[0]) 
        // String.prototype.split() is needed to sort out non-hashtag related string data
    }
}


console.log(hashtags) // outputs [ 'выходные', 'holiday' ]

We use if (text[0] !== '#') i++ to check if first letter in "text" string is not a '#'. In that case we do not need to iterate through the first element in the preHashtags Array. Otherwise, our text string starts with a hashtag and we need to handle it.

Take note that you may need to do input validation of resulting hashtags array. Also note that null in the for loop is only for readability purposes, you could also use for (;i < preHashtags.length; i++)

The benefit of this approach is that it definitely includes any possible symbol (hence the need for sanity checks), including all non-latin alphabets, as well as simpler to understand, especially for beginners. The performance, on the other hand, is superior, when checked in Chrome (and thus probably other Chromium-derived browsers, as well as node.js), while 6-7% worse in Firefox & 13% worse in Safari, judged by this test: https://jsben.ch/VuhEi.

Thus, the choice depends on whether you are going to run your code in node.js or browser and if it is the latter, do you have a lot of mobile clients using MobileSafari?

Upvotes: 0

Maxim Zubarev
Maxim Zubarev

Reputation: 2473

Here is another very simple regex which will allow using emojis and numbers in hashtags as well as not using any white space to have them split. Most of the time this should be more than sufficent:

"Mmmm #yummy #donut at #CZ#efrefg #:) #cool😎#r234#FEGERGR#fegergr".match(/#[^\s#]*/gmi);
// => ["#yummy", "#donut", "#CZ", "#efrefg", "#:)", "#cool😎", "#r234", "#FEGERGR", "#fegergr"]

There is a little downside though: This regex will add punctuation to the end of hashtags, e.g.:

"Mmmm #yummy.#donut#cool😎#r234#FEGERGR;#fegergr".match(/#[^\s#]*/gmi);
// => ["#yummy.", "#donut", "#cool😎", "#r234", "#FEGERGR;", "#fegergr"]

But you can extend the regex yourself to the characters (punctuation) that you want to omit though, like this:

"Mmmm #yummy.#donut#cool😎#r234#FEGERGR;#fegergr".match(/#[^\s#\.\;]*/gmi);
// => ["#yummy", "#donut", "#cool😎", "#r234", "#FEGERGR", "#fegergr"]

Upvotes: 2

randomor
randomor

Reputation: 5663

Do you care about Unicode or non-English hashtags?

"Mmmm #yummy #donut at #CZ #中文 #.dou #。#?#♥️ #にほ".match(/#[\p{L}]+/ugi)
=> (5) ["#yummy", "#donut", "#CZ", "#中文", "#にほ"]

As explained by this answer: https://stackoverflow.com/a/35112226/515585

\p{L} matches unicode characters

u the PCRE_UTF8 modifier, this modifier turns on additional functionality of PCRE that is incompatible with Perl.

Upvotes: 13

Sagiv Ofek
Sagiv Ofek

Reputation: 25270

if you care about readability:

yourText.split(' ').filter(v=> v.startsWith('#'))

will return ["#yummy", "#donut", "#CZ"]

Upvotes: 6

Peter Olson
Peter Olson

Reputation: 142921

Just use a regular expression to find occurences of a hash followed by non-whitespace characters.

"Mmmm #yummy #donut at #CZ".match(/#\w+/g)
// evaluates to ["#yummy", "#donut", "#CZ"]

Upvotes: 18

Keir Simmons
Keir Simmons

Reputation: 1684

This will do it for anything with alphabetic characters, you can extend the regexp for other characters if you want:

myString.match(/#[a-z]+/gi);

Upvotes: 17

Related Questions