Reputation: 414655

How to convert Roman numerals to int while rejecting invalid numbers using standard C?

What it means to be proper Roman numerals may vary. For simplicity (no Unicode, no multiplicative principle, no double subtractives, no overbars, no large numbers, etc) for the sake of this question, valid Roman numerals are defined by the regex:

^(M{0,3})(D?C{0,3}|CM|CD)(L?X{0,3}|XC|XL)(V?I{0,3}|IX|IV)$

Code example with POSIX regexec(). The regex matches Roman numerals in 1..3999 range represented using "strict" rules.

There are many solutions that can convert Roman numerals if we don't need to reject invalid numerals, for example:

int roman_numeral_value(unsigned char c)
{
  switch(toupper(c)) {
  case 'I': return 1;
  case 'V': return 5;
  case 'X': return 10;
  case 'L': return 50;
  case 'C': return 100;
  case 'D': return 500;
  case 'M': return 1000;
  default: return 0; // error
  }
}

int roman_numeral_to_int(const char *s, int size)
{
  int total = 0, prev = 0;
  for (int i = size-1; i >= 0; --i) { // in reverse order
    int value = roman_numeral_value(s[i]);
    total += value < prev ? -value : value; // subtract if necessary
    prev = value;
  }
  return total;
}

It works for valid Roman numerals. But roman_numeral_to_int() accepts numerals such as IIIII that are rejected by the regex. Is there a similarly simple cross-platform solution that doesn't require pcre_exec() or other external dependencies that works for valid Roman numerals and only for them?

Upvotes: 3

Answers (5)

chux

Reputation: 154075

To create some level of rule flexibility, the following Roman_string_to_unsigned0() employs a table.

It follows the strtol() style of functionality in that an end-pointer is returned indicating where parsing stopped. De-ref and test against '\0' for success.

The function has a bool subtractive parameter to steer the two major types of Roman Numeral parsing: basic, subtractive.

static const struct Roman_digit {
  char ch[3];
  bool subtractive;
  unsigned char limit;
  unsigned char nextdown;  // with parse success, offset to next element to try
  unsigned value;
} Roman_table[] = {
    { "I", false, 4, 1, 1 }, //
    { "IV", true, 1, 2, 4 }, //
    { "V", false, 1, 2, 5 }, //
    { "IX", true, 1, 4, 9 }, //
    { "X", false, 4, 1, 10 }, //
    { "XL", true, 1, 2, 40 }, //
    { "L", false, 1, 2, 50 }, //
    { "XC", true, 1, 4, 90 }, //
    { "C", false, 4, 1, 100 }, //
    { "CD", true, 1, 2, 400 }, //
    { "D", false, 1, 2, 500 }, //
    { "CM", true, 1, 4, 900 }, //
    { "M", false, 4, 1, 1000 }, //
};
#define Roman_table_N (sizeof Roman_table / sizeof Roman_table[0])

const char *Roman_string_to_unsigned0(unsigned *dest, const char *src, bool subtractive){
  *dest = 0;
  for (unsigned i = Roman_table_N; i > 0;) {
    const struct Roman_digit *digit = &Roman_table[i - 1];
    if (!subtractive && digit->subtractive) {
      i--;
      continue;
    }
    unsigned limit = digit->limit;  // repeat count
    if (limit > 1 && subtractive) limit--;
    size_t ch_length = strlen(digit->ch);
    size_t next_i = i-1;
    for (unsigned j=0; j<limit; j++) {
      if (strncmp(src, digit->ch, ch_length) == 0) {
        *dest += digit->value;
        if (*dest < digit->value) { // Overflow detection
          return (char*) src;
        }
        src += ch_length;
        next_i = i - digit->nextdown;  // With success, maybe skip down the list 
      } else {
        break;
      }
    }
    i = next_i;
  }
  return (char*) src;
}

Notes: Case insensitivity not yet encoded. An empty string returns 0. By this code working most-to-least significant, "XXXMMM" does not pass.

Upvotes: 2

jfs

Reputation: 414655

By generating C code from a higher-level specification, we can get a readable solution that uses only standard C. For example, the regex:

  ^(?P<thousands>       M{,3})
   (?P<hundreds>CM|CD|D?C{,3})
   (?P<tens>    XC|XL|L?X{,3})
   (?P<units>   IX|IV|V?I{,3})$

can be represented as a FSM using Ragel finite-state machine compiler:

thousands =                       ('M' %{ n += 1000; }){,3};
hundreds = "CM" %{ n += 900; }
         | "CD" %{ n += 400; }
         | ('D' %{ n += 500; } )? ('C' %{ n += 100;  }){,3};
tens     = "XC" %{ n += 90;  }
         | "XL" %{ n += 40;  }
         | ('L' %{ n += 50;  } )? ('X' %{ n += 10;   }){,3};
units    = "IX" %{ n += 9;   }
         | "IV" %{ n += 4;   }
         | ('V' %{ n += 5;   } )? ('I' %{ n += 1;    }){,3};
numeral = thousands hundreds tens units;

main := numeral > { n = 0; } ;

it is a complete specification: it converts all valid Roman numerals. It rejects all that are invalid
it is concise: you can put it on a card
it is straightforward: initialize n with zero and add thousands, hundreds, tens, and units. 100s, 10s, 1s follow a simple pattern: nine | four | (five? ten{0,3}) e.g., tens part is either 90 or 40 or optional 50 plus upto three 10s.
it is declarative and easy to extend without introducing errors e.g., to support four consecutive numerals such as IIII in addition to subtractive IV, it is enough to replace {,3} with {,4}. To support Unicode/lower/upper case letters, the corresponding literals such as 'M' could be replaced with M where M = 'M' | 'm' | "Ⅿ" | "ⅿ";
ragel translates it to a fast table- or goto-driven FSM in pure C.

Complete code example (with Unicode and IIII extensions mentioned above). The generated roman_numerals.c has no third-party dependencies.

Upvotes: 2

chux

Reputation: 154075

Use strcmp() or in other words, round-trip the string.

First consider the reverse problem, number --> string.

There are many ways to efficiently convert an integer to a string of Roman numerals. Let us call it:

// return false on error due to `value` range error or scant `size`
bool roman_int_to_string(char *dest, size_t size, int value);

Aside from letter case concerns, there is a one-to-one relationship between a canonical Roman number string and an int. Simply convert the source string to an int and then the int into another test string. If these strings match, we have a winner.

#define ROMAN_STRING_N 20
int roman_numeral_to_int_with_validate(const char *s, int size, bool *is_canonical) {
  int value = roman_numeral_to_int(s, size);

  char test[ROMAN_STRING_N];
  *is_canonical = roman_int_to_string(test, sizeof test, value);
  if (*is_canonical) {
    if (strcmp(s, test)) {  // Or use a case insensitive compare as desired
      *is_canonical = false;
    }
  }

  return value;
}

Lesson: I did code up a direct validation function. To test it I needed the inverse roman_int_to_string(). A random string generator rapidly showed many surprising errors like "CMC" and "CMCD" as well as OP's "IIII". In the end, using a simplistic string-to-int and int-to-string function pair and then doing a string compare was the most resilient.

Upvotes: 1

M Oehm

Reputation: 29126

The Roman digits come in two classes, the "ones" (I, X, C, M) and the "fives" (V,L, D). The "fives" can't be repeated and cannot be substracted. The "ones" can be repeated up to three times when they don't come after a smaller number and the can be subtracted from a number that isn't greater than the next "one".

When parsing, a digit can be in three different modes: It is either added normally or it is a number to be subtracted or it is a number from which the preceding number was substracted.

You can enforce these rules as you build your number. In addition to the value of a digit, you need a function that classifies the digit. In the code below, the function repeat does this. It returns the maximum number of repetitions per number, but it also serves as classification: 3 means a "one" and 1 means a "five."

The code below seems to produce the same results as your code with regex validation. It returns a positive number for valid Roman numbers and −1 otherwise. (And it has fewer than 28 conditionals.)

int digit(int c)
{
    if (c == 'I') return 1;
    if (c == 'V') return 5;
    if (c == 'X') return 10;
    if (c == 'L') return 50;
    if (c == 'C') return 100;
    if (c == 'D') return 500;
    if (c == 'M') return 1000;
    return 0;
}

int repeat(int c)
{
    if (c == 'I') return 3;
    if (c == 'V') return 1;
    if (c == 'X') return 3;
    if (c == 'L') return 1;
    if (c == 'C') return 3;
    if (c == 'D') return 1;
    if (c == 'M') return 3;
    return 0;
}

int from_roman(const char *s)
{
    int res = 0;                // running result
    int prev = 10000;           // value of previous digit

    if (s == NULL || *s == '\0') return -1;

    while (*s) {
        int c = *s++;                           // Roman digit
        int count = 1;                          // count of consecutive numbers
        int value = digit(c);                   // digit value
        int max = repeat(c);                    // allowed repetitions

        if (value == 0) return -1;              // illegal Roman digit

        while (*s == c) {
            s++;
            count++;
        }

        if (*s && digit(*s) > value) {
            int next = digit(*s++);

            if (max != 3) return -1;            // can only subtract I, X, C
            if (count > 1) return -1;           // can only subtract once
            if (next > 10 * value) return -1;   // IM,ID, IC, IL etc. invalid
            if (value * 10 > prev) return -1;   // VIV, VIX etc. invalid

            res += next - value;
        } else {
            if (count > max) return -1;         // too many repetitions
            if (value >= prev) return -1;       // must decrease

            res += count * value;
        }

        prev = value;
    }

    return res;
}

Edit: The first two drafts of my code had errors, which are now fixed.

Since the validation of correctness is done via a regular expression, another approach is to implement the regex directly, while at the same time calculating the value of the Roman number. Also, given how complicated it is to get the logic to the Roman numbers right, this might be the better approach.

An implementation of this approach could be:

/*
 *      Returns the length of the digit stretch and advances the pointer
 */
static int stretch(const char **s, int m, int max)
{
    int n = 0;

    while (n < max && **s == m) {
        (*s)++;
        n++;
    }
    return n;
}

/*
 *      Parses (I II III IV V VI VII VIII IX) for ones,
 *      tens and hundreds and advances the pointer.
 */
static int parse(const char **s, int x, int v, int i)
{
    int res = 0;

    if (**s == i && *(*s + 1) == x) {
        res += 9;
        *s += 2;
    } else if (**s == i && *(*s + 1) == v) {
        res += 4;
        *s += 2;
    } else {
        res += stretch(s, v, 1) * 5;
        res += stretch(s, i, 3);
    }

    return res;
}

/*
 *      Parse a Roman numeral according the the regex; -1 means failure
 */
int from_roman_regex(const char *s)
{
    int res = 0;

    if (s == NULL || *s == '\0') return -1;

    res += stretch(&s, 'M', 3) * 1000;
    res += parse(&s, 'M', 'D', 'C') * 100;
    res += parse(&s, 'C', 'L', 'X') * 10;
    res += parse(&s, 'X', 'V', 'I') * 1;

    if (*s) return -1;
    return res;
}

The stretch function emulates regexes such as X{0,3}; the parse function emulates regexes such as (V?I{0,3}|IX|IV), but in addition to singalling matching success or failure, it evaluates it as Roman number.

The first approach tries to implement the rules of Roman numbers. This is a bit complicated, but has the advantage that one cound extend it easily to provide exact error messages if one wanted to. The second approach has the advantage that it matches the question's spec exactly: It does what the regex does.

I've tested all Roman numbers up to 3,999 and all combinations of up to 7 Roman digits. The two approaches above and the OP's approach – simple aritgmetic plus regex validation – yielded the same results for all cases.

Upvotes: 2

Milap Pancholi

Reputation: 134

Simple and Sweet logic using subtract the value

ng sorry code is in a python but you can follow the logic i have used while i have done this programme**

def checkio(data):
roman=""
while(data!=0):
    if data>=1000:
    data-=1000
        roman+='M'
        elif data>=900:
        data-=900
        roman+='CM'
        elif data>=500:
        data-=500
        roman+='D'
        elif data>=400:
        data-=400
        roman+='CD'
        elif data>=100:
        data-=100
        roman+='C'
        elif data>=90:
        data-=90
        roman+='XC'
        elif data>=50:
        data-=50
        roman+='L'
        elif data>=40:
        data-=40
        roman+='XL'
        elif data>=10:
        data-=10
        roman+='X'
        elif data>=9:
        data-=9
        roman+='IX'
        elif data>=5:
        data-=5
        roman+='V'
        elif data>=4:
        data-=4
        roman+='IV'
        elif data>=1:
        data-=1
        roman+='I'
    return roman

Upvotes: 0

How to convert Roman numerals to int while rejecting invalid numbers using standard C?

Answers (5)

Related Questions