Reputation: 71

Selecting thousands separator character with RegEx

I need to change the decimal separator in a given string that has numbers in it.

What RegEx code can ONLY select the thousands separator character in the string?

It need to only select, when there is number around it. For example only when 123,456 I need to select and replace ,

I'm converting English numbers into Persian (e.g: Hello 123 becomes Hello ۱۲۳). Now I need to replace the decimal separator with Persian version too. But I don't know how I can select it with regex. e.g. Hello 121,534 most become Hello ۱۲۱/۵۳۴

The character that needs to be replaced is , with /

Upvotes: 3

Answers (4)

hakre

Reputation: 198118

According to your question, the main problem you face is to convert the English number into the Persian.

In PHP there is a library available that can format and parse numbers according to the locale, you can find it in the class NumberFormatter which makes use of the Unicode Common Locale Data Repository (CLDR) to handle - in the end - all languages known to the world.

So converting a number 123,456 from en_UK (or en_US) to fa_IR is shown in this little example:

$string = '123,456';
$float = (new NumberFormatter('en_UK', NumberFormatter::DECIMAL))->parse($string);
var_dump(
    (new NumberFormatter('fa_IR', NumberFormatter::DECIMAL))->format($float)
);

Output:

string(14) "۱۲۳٬۴۵۶"

(play with it on 3v4l.org)

Now this shows (somehow) how to convert the number. I'm not so firm with Persian, so please excuse if I used the wrong locale here. There might be options as well to tell which character to use for grouping, but for the moment for the example, it's just to show that conversion of the numbers is taken care of by existing libraries. You don't need to re-invent this, which is even a sort of miss-wording, this isn't anything a single person could do, or at least it would be sort of insane to do this alone.

So after clarifying on how to convert these numbers, question remains on how to do that on the whole text. Well, why not locate all the potential places looking for and then try to parse the match and if successful (and only if successful) convert it to the different locale.

Luckily the NumberFormatter::parse() method returns false if parsing did fail (there is even more error reporting in case you're interested in more details) so this is workable.

For regular expression matching it only needs a pattern which matches a number (largest match wins) and the replacement can be done by callback. In the following example the translation is done verbose so the actual parsing and formatting is more visible:

# some text
$buffer = <<<TEXT
it need to only select , when there is number around it. for example only 
when 123,456 i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello 123" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello 121,534" most become 
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /
TEXT;    

# prepare formatters
$inFormat = new NumberFormatter('en_UK', NumberFormatter::DECIMAL);
$outFormat = new NumberFormatter('fa_IR', NumberFormatter::DECIMAL);

$bufferWithFarsiNumbers = preg_replace_callback(
    '(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u',
    function (array $matches) use ($inFormat, $outFormat) {
        [$number] = $matches;

        $result = $inFormat->parse($number);
        if (false === $result) {
            return $number;
        }

        return sprintf("< %s (%.4f) = %s >", $number, $result, $outFormat->format($result));
    },
    $buffer
);

echo $bufferWithFarsiNumbers;

Output:

it need to only select , when there is number around it. for example only 
when < 123,456 (123456.0000) = ۱۲۳٬۴۵۶ > i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello < 123 (123.0000) = ۱۲۳ >" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello < 121,534 (121534.0000) = ۱۲۱٬۵۳۴ >" most become 
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /

Here the magic is just two bring the string parts into action with the number conversion by making use of preg_replace_callback with a regular expression pattern which should match the needs in your question but is relatively easy to refine as you define the whole number part and false positives are filtered thanks to the NumberFormatter class:

                    pattern for Unicode UTF-8 strings
                                 |
(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u
  |                 |          |
  |        grouping character  |
  |                            |
word boundary -----------------+

(play with it on regex101.com)

Edit:

To only match the same grouping character over multiple thousand blocks, a named reference can be created and referenced back to it for the repetition:

(\b[1-9]\d{0,2}(?:(?<grouping_char>[ ,.])\d{3}(?:(?&grouping_char)\d{3})*)?\b)u

(now this get's less easy to read, get it deciphered and play with it on regex101.com)

To finalize the answer, only the return clause needs to be condensed to return $outFormat->format($result); and the $outFormat NumberFormatter might need some more configuration but as it is available in the closure, this can be done when it is created.

(play with it on 3v4l.org)

I hope this is helpful and opens up a broader picture to not look for solutions only because hitting a wall (and only there). Regex alone most often is not the answer. I'm pretty sure there are regex-freaks which can give you a one-liner which is pretty stable, but the context of using it will not be very stable. However not saying there is only one answer. Instead bringing together different levels of doings (divide and conquer) allows to rely on a stable number conversion even if yet still unsure on how to regex-pattern an English number.

Upvotes: 4

Barmar

Reputation: 781741

Use a regular expression with lookarounds.

$new_string = preg_replace('/(?<=\d),(?=\d)/', '/', $string);

DEMO

(?<=\d) means there has to be a digit before the comma, (?=\d) means there has to be a digit after it. But since these are lookarounds, they're not included in the match, so they don't get replaced.

Upvotes: 4

Joffrey Schmitz

Reputation: 2438

You can write a regex to capture numbers with thousand separator, and then aggregate the two numeric parts with the separator you want :

$text = "Hello, world, 121,534" ;
$pattern = "/([0-9]{1,3}),([0-9]{3})/" ;
$new_text = preg_replace($pattern, "$1X$2", $text); // replace comma per 'X', keep other groups intact.

echo $new_text ; // Hello, world, 121X534

Upvotes: 0

Luuk

Reputation: 14948

In PHP you can do that using str_replace

$a="Hello 123,456";
echo str_replace(",", "X", $a);

This will return: Hello 123X456

Upvotes: -1

Selecting thousands separator character with RegEx

Answers (4)

Related Questions