Reputation: 13

How to use recursion to parse text into a tree structure?

I am new to C# and I am completely stuck on a parsing issue where I want to use recursion but my attempts to do so have gotten me nowhere.

I want to read in a file that has the following format:

root:
   fileA
   sub1:
      fileB
      fileC
      sub2:
         fileD
         fileE
      fileF
   fileG
   sub3:
      fileH

Essentially, the lines ending the colons (:) are supposed to represent directories and the lines that do not end in colons are supposed to represent files in their parent directory as such: fileA and fileG belong in the root directory, fileB, fileC, and fileF are located within the sub1 directory, and so on (location is determined by indents/spaces).

Thus, I want to read in this file, and more complex files with a similar structure in a better way than I am currently doing it (a horrendous mess of for loops and if statements). I am using simple custom classes for the directories and files (I am not using .NET classes except for StreamReader for reading in the text file line by line)

I have done something similar in python once but for some reason I cannot wrap my head around how to do it in C#, which is silly since this is an abstract problem and the language specific implementations should not matter that much. I guess what does matter is my lack of understanding of how to best apply recursion in situations like these. Am I on the right track? I just sense that there is a much more elegant way of parsing this file into my self defined classes (retaining a tree structure in the sample text) and I think recursion is the answer. I just cannot see it.

Any help would be appreciated, even if it is less an answer and more a violent shove in the right direction. Or a gentle nudge.

Sample Code in C# trying to use Recursion (incomplete, but I hope it gets across what I am trying to do):

public void buildDirectoryFromFile(string file)
{
    string line;
    StreamReader data = new StreamReader(file);           

    int depth = 0;
    while ((line = data.ReadLine()) != null)
    {
        depth = line.Length - line.TrimStart(' ').Length;
        parseTextIntoTree(line, depth);
    }            
}  


public void parseTextIntoTree(string line, int depth)
{    
   if (line.Contains(':'))
   {
      //Totally lost      
   }
   else
   {
       //Totally lost
   }         
}

Depth in this case refers to the spaces/indents. The more spaces there are between the string and the left margin the 'deeper' it is in the tree.

Upvotes: 1

Answers (4)

Mohit S

Reputation: 14064

Try this. Since you specified the depth would be refering to the spaces/indents. Thus

TreeStruc = @"root:
               fileA
               sub1:
                  fileB
                  fileC
                  sub2:
                     fileD
                     fileE
                  fileF
               fileG
               sub3:
                  fileH";
TreeStruc = Regex.Replace(TreeStruc, " ", "-");
TreeStruc = Regex.Replace(TreeStruc, "---", "-");
using(StringReader reader = new StringReader(TreeStruc))
{
    string line;
    while((line = reader.ReadLine()) != null)
    {
        int cnt = line.Count(f => f == '-');
        String str2Replace = new String('-', cnt);
        line = Regex.Replace(line, str2Replace == "" ? "-" : str2Replace, cnt.ToString() + "~ ");
        updatedTreeStruc = updatedTreeStruc + line + "\n";
    }
}

It would show the result as follows:

root:
1~ fileA
1~ sub1:
2~ fileB
2~ fileC
2~ sub2:
3~ fileD
3~ fileE
2~ fileF
1~ fileG
1~ sub3:
2~ fileH

Now from here on we already know the depth that is the number in front of the string.

In the same way we can analyze whether the string is a folder or a file.

Upvotes: 0

erisco

Reputation: 14329

I do not have much experience with parsing layout-style or whitespace sensitive languages, but from what I have experienced such a task does not fall under the usual parsing solutions. You at least have to step beyond context-free languages.

The way I decided to model this example of layout rules is as a binary tree. Nodes are the line content. Descending to the left represents maintaining the same indent level whereas descending to the right represents increasing the indent level.

 root
/   \
ε   fileA
   /     \
  sub1:   ε
  |    \___
  |        |
  fileG     fileB – ε
  |         |     
  sub3:     fileC – ε
  | \       |
  ε  fileH  sub2: —+
            |      |
            fileF  fileD
                   |
                   fileE

I have modeled your source file as such a tree. You will also note that in respect to line order the tree descends to the right first, then left second.

What we have now is a way to view the source in a brace-style way which unlike layout-style can be parsed with any language parsing tool. For example, say we want to generate tokens to be consumed by a parser. This can be easily done recursively as your intuition hinted.

Emit the token for the root node (if the node is ε then we emit no token).
Emit an open brace and then recurse on the right subtree.
Emit a close brace and then recurse on the left subtree.

In terms of tree traversals this is close to a right-to-left preorder. However, we are also adding a close brace in-order.

I now follow this algorithm to tokenise the example. For simplicity I just use the names in the original source as the token names and also add { and } as tokens.

(recursion_depth.step_number)
(1.1) root
(1.2) root {
(2.1) root { fileA
(2.2) root { fileA {
(2.3) root { fileA { }
(3.1) root { fileA { } sub1:
(3.2) root { fileA { } sub1: {
(4.1) root { fileA { } sub1: { fileB
etc.

Finally arriving at (formatted for clarity)

root {
  fileA { }
  sub1: {
    fileB { }
    fileC { }
    sub2: {
      fileD { }
      fileE { }
    }
    fileF { }
  }
  fileG { }
  sub3: {
    fileH { }
  }
}

From here I hope it is clearer how you can construct an abstract syntax tree for your language. If you want to construct the tree I mention then consider keeping a stack of indentation levels. It requires a bit of thinking but I'll leave you to that.

Upvotes: 1

drvolcano

Reputation: 110

This code works with your posted file:

     //The top-level Node
    TreeNode root;

    //Temp value for processing in parseTextIntoTree
    TreeNode ParentNode;

    //Temp value for processing in parseTextIntoTree
    int actualdepth = -1;

    //The depth step between two layers
    int depthdifference = 3;

    public void buildDirectoryFromFile(string file)
    {
        string line;
        StreamReader data = new StreamReader(file);

        int depth = 0;
        while ((line = data.ReadLine()) != null)
        {
            depth = line.Length - line.TrimStart(' ').Length;
            parseTextIntoTree(line, depth);
        }

        this.treeView1.Nodes.Add(root);
    }

    public void parseTextIntoTree(string line, int depth)
    {
        //At the beginning define the root node
        if (root == null)
        {
            root = new TreeNode(line);
            ParentNode = root;
            actualdepth = depth;
        }
        else
        {
            //Search the parent node for the current depth
            while (depth <= actualdepth)
            {
                //One step down
                ParentNode = ParentNode.Parent;
                actualdepth -= depthdifference;
            }

            ParentNode = ParentNode.Nodes.Add(line.Trim());
            actualdepth = depth;
        }
    }

Upvotes: 0

Gilang

Reputation: 311

Let's try this:

public void buildDirectoryFromFile(string file)
{
    string line;
    StreamReader data = new StreamReader(file);           

    List<string> lines = new List<string>();
    while ((line = data.ReadLine()) != null)
    {
        lines.Add(line);

    }    
    int lineProcessed = 0;
    ParseTextIntoTree(lines, ref lineProcessed, 0);        
}  

public const int PadCount = 3; // Your padding length in text file

public void ParseTextIntoTree(List<string> lineList, ref int lineProcessed, int depth)
{
    while(lineProcessed < lineList.Count)
    {
        string line = lineList[lineProcessed++];
        int lineDepth = line.Length - line.TrimStart(' ').Length;

        if(lineDepth < depth)
        {
            // If the current depth is lower than the desired depth
            // end of directory structure is reached. Do backtrack
            // and reprocess the line in upper depth
            lineProcessed--;
            return;
        }

        if(line.EndsWith(":"))
        {
            // Do something, perhaps create directory?
            ParseTextIntoTree(lineList, ref lineProcessed, depth + PadCount);
        }
        else
        {
            // Do something, perhaps create file?
        }
    }
}

Upvotes: 0

How to use recursion to parse text into a tree structure?

Answers (4)

Related Questions