Thorin Oakenshield
Thorin Oakenshield

Reputation: 14692

How to create a parser(lex/yacc)?

I'm having the following file and which need to be parsed

--TestFile
Start ASDF123
Name "John"
Address "#6,US" 
end ASDF123

The lines start with -- will be treated as comment lines. and the file starts 'Start' and ends with end. The string after Start is the UserID and then the name and address will be inside the double quots.

I need to parse the file and write the parsed data into an xml file.

So the resulting file will be like

<ASDF123>
  <Name Value="John" />
  <Address Value="#6,US" />
</ASDF123>

now i'm using pattern matching(Regular Expressions) to parse the above file . Here is my sample code.

    /// <summary>
    /// To Store the row data from the file
    /// </summary>
    List<String> MyList = new List<String>();

    String strName = "";
    String strAddress = "";
    String strInfo = "";

Method : ReadFile

    /// <summary>
    /// To read the file into a List
    /// </summary>
    private void ReadFile()
    {
        StreamReader Reader = new StreamReader(Application.StartupPath + "\\TestFile.txt");
        while (!Reader.EndOfStream)
        {
            MyList.Add(Reader.ReadLine());
        }
        Reader.Close();
    }

Method : FormateRowData

    /// <summary>
    /// To remove comments 
    /// </summary>
    private void FormateRowData()
    {
        MyList = MyList.Where(X => X != "").Where(X => X.StartsWith("--")==false ).ToList();
    }

Method : ParseData

    /// <summary>
    /// To Parse the data from the List
    /// </summary>
    private void ParseData()
    {
        Match l_mMatch;
        Regex RegData = new Regex("start[ \t\r\n]*(?<Data>[a-z0-9]*)", RegexOptions.IgnoreCase);
        Regex RegName = new Regex("name [ \t\r\n]*\"(?<Name>[a-z]*)\"", RegexOptions.IgnoreCase);
        Regex RegAddress = new Regex("address [ \t\r\n]*\"(?<Address>[a-z0-9 #,]*)\"", RegexOptions.IgnoreCase);
        for (int Index = 0; Index < MyList.Count; Index++)
        {
            l_mMatch = RegData.Match(MyList[Index]);
            if (l_mMatch.Success)
                strInfo = l_mMatch.Groups["Data"].Value;
            l_mMatch = RegName.Match(MyList[Index]);
            if (l_mMatch.Success)
                strName = l_mMatch.Groups["Name"].Value;
            l_mMatch = RegAddress.Match(MyList[Index]);
            if (l_mMatch.Success)
                strAddress = l_mMatch.Groups["Address"].Value;
        }
    }

Method : WriteFile

    /// <summary>
    /// To write parsed information into file.
    /// </summary>
    private void WriteFile()
    {
        XDocument XD = new XDocument(
                           new XElement(strInfo,
                                         new XElement("Name",
                                             new XAttribute("Value", strName)),
                                         new XElement("Address",
                                             new XAttribute("Value", strAddress))));
        XD.Save(Application.StartupPath + "\\File.xml");
    }

i've heard of ParserGenerator

Please help me to write a parser using lex and yacc. The reason for this is , my exsisting parser(Pattern Matching) is not flexible, more over its not the right way(I think so).

How to i make use of the ParserGenerator(I've read Code Project Sample One and Code Project Sample Two but still i'm not familiar with this). Please suggest me some parser generator which outputs C# parsers.

Upvotes: 3

Views: 7822

Answers (2)

Aasmund Eldhuset
Aasmund Eldhuset

Reputation: 37990

Gardens Point LEX and the Gardens Point Parser Generator are strongly influenced by LEX and YACC, and output C# code.

Your grammar is simple enough that I think your current approach is fine, but kudos for wanting to learn the "real" way of doing it. :-) So here's my suggestion for a grammar (just the production rules; this is far from a full example. The actual GPPG file needs to replace the ... by C# code for building the syntax tree, and you need token declarations etc. - read the GPPG examples in the documentation. And you also need the GPLEX file that describes the tokens):

/* Your input file is a list of "top level elements" */
TopLevel : 
    TopLevel TopLevelElement { ... }
    | /* (empty) */

/* A top level element is either a comment or a block. 
   The COMMENT token must be described in the GPLEX file as 
   any line that starts with -- . */
TopLevelElement:
    Block { ... }
    | COMMENT { ... }

/* A block starts with the token START (which, in the GPLEX file, 
   is defined as the string "Start"), continues with some identifier 
   (the block name), then has a list of elements, and finally the token
   END followed by an identifier. If you want to validate that the
   END identifier is the same as the START identifier, you can do that
   in the C# code that analyses the syntax tree built by GPPG.
   The token Identifier is also defined with a regular expression in GPLEX. */
Block:
    START Identifier BlockElementList END Identifier { ... }

BlockElementList:
    BlockElementList BlockElement { ... }
    | /* empty */

BlockElement:
    (NAME | ADDRESS) QuotedString { ... }

Upvotes: 5

M&#39;vy
M&#39;vy

Reputation: 5774

You will first have to define the grammar for your parser. (Yacc part)

Seem like to be something like :

file : record file
     ;

record: start identifier recordContent end identifier {//rule to match the two identifiers}
      ;

recordContent: name value; //Can be more detailed if you require order in the fields

The lexical analysis will be perform be lex. And I guess your regex will be useful to defined them.

My answer is a rough draft, I advise you to look on the internet to find a more complete tutorial on lex/yacc flex/bison, and come back here if you have a more focused problem.

I also do not know if there is a C# implementation that would allowed you to keep a managed code. You may have to use unmanaged C / C++ import.

Upvotes: 1

Related Questions