Reputation: 376

Why won't my JavaCC lexer/parser accept this input?

I am creating a lexer/parser which should accept strings that belong to an infinite set of languages.
One such string is "a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>".

The set of languages is defined as follows:

Base language, L0

A string from L0 consists of several blocks separated by space characters. At least one block must be present.
A block is an odd-length sequence of lowercase letters (a-z).
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.

Example of string belonging to L0:

zyx abcba m xyzvv

There is one space character between zyx and abcba, there are three spaces between abcba and m, and only one between m and xyzvv. No other space characters are present in the string.

Language L1

A string from L1 consists of several blocks separated by space characters. At least one block must be present.
There are two kinds of blocks. A block of the first kind must be an even-length sequence of uppercase letters (A-Z). A block of the second kind must have the shape <2U>. . .</2U>, where . . . stands for any string from L0.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.

Example of string belonging to L1:

YZ <2U>abc zzz</2U> ABBA <2U>kkkkk</2U> KM

Note that five spaces separate YZ and <2U>abc zzz</2U>, and three spaces divide abc from zzz. Otherwise single spaces are used as separators. There is no space in front of YZ and no space follows KM.

Language L2

A string from L2 consists of several blocks separated by space characters. At least one block must be present.
There are two kinds of blocks. A block of the first kind must be an odd-length sequence of lowercase letters (a-z). A block of the second kind must have the shape <2L>. . .</2L>, where . . . stands for any string from L1.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.

Example of string belonging to L2:

abc <2L>AA ZZ <2U>a bcd</2U></2L> z <2L><2U>abcde</2U></2L>

Single spaces are used as separators inside the sentence given above, but any other odd number of spaces would also lead to a valid L2 sentence.

Languages L{2k + 1}, k > 0

A string from L{2k + 1} consists of several blocks separated by space characters. At least one block must be present.
There are two kinds of blocks. A block of the first kind must be an even-length sequence of uppercase letters (A-Z). A block of the second kind must have the shape <2U>. . .</2U>, where . . . stands for any string from L{2k}.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.

Languages L{2k + 2}, k > 0

A string from L{2k + 2} consists of several blocks separated by space characters. At least one block must be present.
There are two kinds of blocks. A block of the first kind must be an odd-length sequence of lowercase letters (a-z). A block of the second kind must have the shape <2L>. . .</2L>, where . . . stands for any string from L{2k + 1}.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.

The code for my lexer/parser is as follows:

PARSER_BEGIN(Assignment)

  /** A parser which determines if user's input belongs to any one of the set of acceptable languages. */
  public class Assignment {
    public static void main(String[] args) { 
      try {
        Assignment parser = new Assignment(System.in);
        parser.Start();
        System.out.println("YES"); // If the user's input belongs to any of the set of acceptable languages, then print YES.
      } catch (ParseException e) {
        System.out.println("NO"); // If the user's input does not belong to any of the set of acceptable languages, then print NO.      
      }
    }
  }

PARSER_END(Assignment)

//** A token which matches any lowercase letter from the English alphabet. */
TOKEN :
{
  < #L_CASE_LETTER: ["a"-"z"] >
}

//* A token which matches any uppercase letter from the English alphabet. */
TOKEN:
{
  < #U_CASE_LETTER: ["A"-"Z"] >
}

//** A token which matches an odd number of lowercase letters from the English alphabet. */
TOKEN:
{
  < ODD_L_CASE_LETTER: <L_CASE_LETTER>(<L_CASE_LETTER><L_CASE_LETTER>)* >
}

//** A token which matches an even number of uppercase letters from the English alphabet. */
TOKEN:
{
  < EVEN_U_CASE_LETTERS: (<U_CASE_LETTER><U_CASE_LETTER>)+ >
}

//* A token which matches the string "<2U>" . */
TOKEN:
{
  < OPEN_UPPER: "<2U>" >
}

//* A token which matches the string "</2U>". */
TOKEN:
{
  < CLOSE_UPPER: "</2U>" >
}

//* A token which matches the string "<2L>". */
TOKEN:
{
  < OPEN_LOWER: "<2L>" >
}

//* A token which matches the string "</2L>". */
TOKEN:
{
  < CLOSE_LOWER: "</2L>" >
}

//* A token which matches an odd number of white spaces. */
TOKEN : 
{
  < ODD_WHITE_SPACE: " "(" "" ")* >
}

//* A token which matches an EOL character. */
TOKEN:
{
 < EOL: "\n" | "\r" | "\r\n" >
}

/** This production matches strings which belong to the base language L^0. */
void Start() :
{}
{
  LOOKAHEAD(3)
  <ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> <ODD_L_CASE_LETTER>)* <EOL> <EOF>

  |

  NextLanguage()

  |

  LOOKAHEAD(3)
  NextLanguageTwo()

  |

  EvenLanguage()
}

/** This production matches strings which belong to language L^1. */
void NextLanguage():
{}
{
  (<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* <EOL> <EOF>

  |

  (<EVEN_U_CASE_LETTERS>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* <EOL> <EOF>
}

/** This production matches either an even number of uppercase letters, or a string from L^0, encased within the tags <2U> and </2U>. */
void UpperOrPseudoStart() :
{}
{
  <EVEN_U_CASE_LETTERS>

  |

  <OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>
}

/** This production matches strings from L^0, in a similar way to Start(); however, the strings that it matches do not have EOL or EOF characters after them. */
void PseudoStart() :
{}
{
  <ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> <ODD_L_CASE_LETTER>)*
}

/** This production matches strings which belong to language L^2. */
void NextLanguageTwo() :
{}
{
  (<ODD_L_CASE_LETTER>)+ (<ODD_WHITE_SPACE> LowerOrPseudoNextLanguage())* <EOL> <EOF>

  |

  (<OPEN_LOWER> PseudoNextLanguage() <CLOSE_LOWER>)+ (<ODD_WHITE_SPACE> LowerOrPseudoNextLanguage())* <EOL> <EOF>
}

/** This production matches either an odd number of lowercase letters, or a string from L^1, encased within the tags <2L> and </2L>. */
void LowerOrPseudoNextLanguage() :
{}
{
  <ODD_L_CASE_LETTER> 

  |

  <OPEN_LOWER> PseudoNextLanguage() <CLOSE_LOWER>
}

/** This production matches strings from L^1, in a similar way to NextLanguage(); however, the strings that it matches do not have EOL or EOF characters after them. */
void PseudoNextLanguage() :
{}
{
  (<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())*

  |

  (<EVEN_U_CASE_LETTERS>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())*
}

/** This production matches strings which belong to any of the languages L^{2k + 2}, where k > 0 (the infinite set of even languages). */ 
void EvenLanguage() :
{}
{
  (<ODD_L_CASE_LETTER>)+ (<ODD_WHITE_SPACE> EvenLanguageAuxiliary())* <EOL> <EOF>

  |

  (CommonPattern())+ (<ODD_WHITE_SPACE> EvenLanguageAuxiliary())* <EOL> <EOF>
}

/** This production is an auxiliary production that helps when parsing strings from any of the even set of languages. */
void EvenLanguageAuxiliary() :
{}
{
  CommonPattern()

  |

  <ODD_L_CASE_LETTER>
}


void CommonPattern() :
{}
{
  <OPEN_LOWER> <EVEN_U_CASE_LETTERS> <ODD_WHITE_SPACE> <OPEN_UPPER> <ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> CommonPattern())+ <CLOSE_UPPER> <CLOSE_LOWER>
}

Several times now, I have inputted the string "a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>".
However, each time, NO is printed out on the terminal.
I have looked through my code carefully several times, checking the order in which I think the input string should be parsed; but, I haven't been able to find any errors in my logic or reasons why the string isn't being accepted.

Could I have some suggestions as to why it isn't being accepted, please?

Upvotes: 0

Answers (2)

Caleb Owusu-Yianoma

Reputation: 376

The following steps helped to solve the problem.

Run the following code:
javacc -debug_parser Assignment.jj
javac Assignment*.java
Then, run the lexer/parser (by typing java Assignment) and then input the string:
"a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>"
The resulting trace of parser actions shows that the production NextLangaugeTwo() is called on this string, rather than the desired EvenLanguage() production.
Tracing through NextLangaugeTwo() shows that it matches the first eight tokens in the input string.
So, using a lookahead of 9, although inefficient, causes the input string to be accepted. That is, modify the Start() production by changing the second lookahead value (just above the call to NextLanguageTwo()) from 3 to 9.

Upvotes: 1

sjwarner

Reputation: 462

Are any of your inputs being accepted? I have copied your code over to my computer and have found that any correct input (as far as I can tell from the definition of your language), it always outputs 'NO'.

Upvotes: 0

Why won&#39;t my JavaCC lexer/parser accept this input?

Answers (2)

Related Questions

Why won't my JavaCC lexer/parser accept this input?