Writing good Scala without co-routines (including a Python example using yield)

Question

I'm currently learning Scala and looking for an elegant solution to a problem that is easily solved by the use of co-routines.

Since co-routines are not enabled by default in Scala I assume them to be at least not widely accepted best practice and thus want to write my code without using them.

A convincing argument that co-routines/continuations are the best practice to employ would be an alternative acceptable answer.

generate_all_matching_paths function

I want to write a function which searches files inside a base directory. The match and descend criteria should be provided by an instance of a class that has the "PathMatcher" trait. (At least I think this is the way to go in Scala)

A PathMatcher can be used to determine if a fs_item_path is matching AND it determines if the search should descend into a directory (in case the fs_item_path is the path of a directory).

My now following approach that I took for the Python implementation is only provided to indicate what functionality I have in mind.

I want to write this code the "Scala way".

I'm aiming for a solution with these characteristics:

be readable
use Scala idioms where they fit well
return the matched paths during the search (and not after)
works with very large file collections (but not extremely deeply nested) - i.e. thousands of files in a directory; many millions of files in total

I assume that the solution will involve lazy evaluating streams, but I was not able to assemble the stream in a working way.

I have also read that if used incorrectly, lazy streams can keep copies of 'old values' around. The solution I'm after would not do this.

arguments

base_abs_path

absolute path of the directory to start the search at

rel_ancestor_dir_list

list of directory names that indicate how far we have descended into base_abs_path subdirectories

rel_path_matcher

An instance of a class with PathMatcher trait.

In the example below I used a regular expression implementation, but I don't want to limit the use to regular expressions.

Example in Python

Here a complete working Python program (tested with Python 3.4) that includes a Python version of "generate_all_matching_paths".

The program will search "d:\Projects" for file system paths that end with "json", analyze the files for what indentation they use, and then print out the results.

If a path includes the substring "python_portable", then the search will not descend into that directory.

import os
import re
import codecs

#
# this is the bespoke function I want to port to Scala
#
def generate_all_matching_paths(
    base_dir_abs_path,
    rel_ancestor_dir_list,
    rel_path_matcher
):
    rooted_ancestor_dir_list = [base_dir_abs_path] + rel_ancestor_dir_list
    current_dir_abs_path = os.path.join(*rooted_ancestor_dir_list)

    dir_listing = os.listdir(current_dir_abs_path)
    for fs_item_name in dir_listing:
        fs_item_abs_path = os.path.join(
            current_dir_abs_path,
            fs_item_name
        )
        fs_item_rel_ancestor_list = rel_ancestor_dir_list + [fs_item_name]
        fs_item_rel_path = os.path.join(
            *fs_item_rel_ancestor_list
        )

        result = rel_path_matcher.match(fs_item_rel_path)
        if result.is_match:
            yield fs_item_abs_path
        if result.do_descend and os.path.isdir(fs_item_abs_path):
            child_ancestor_dir_list = rel_ancestor_dir_list + [fs_item_name]
            for r in generate_all_matching_paths(
                base_dir_abs_path,
                child_ancestor_dir_list,
                rel_path_matcher
            ):
                yield r



#
# all following code is only a context giving example of how generate_all_matching_paths might be used
#

class MyMatchResult:

    def __init__(
        self,
        is_match,
        do_descend
    ):
        self.is_match = is_match
        self.do_descend = do_descend

# in Scala this should implement the PathMatcher trait
class MyMatcher:

    def __init__(
        self,
        rel_path_regex,
        abort_dir_descend_regex_list
    ):
        self.rel_path_regex = rel_path_regex
        self.abort_dir_descend_regex_list = abort_dir_descend_regex_list

    def match(self, path):

        rel_path_match = self.rel_path_regex.match(path)
        is_match = rel_path_match is not None

        do_descend = True
        for abort_dir_descend_regex in self.abort_dir_descend_regex_list:
            abort_match = abort_dir_descend_regex.match(path)
            if abort_match:
                do_descend = False
                break

        r = MyMatchResult(is_match, do_descend)
        return r

def leading_whitespace(file_path):
    b_leading_spaces = False
    b_leading_tabs = False

    with codecs.open(file_path, "r", "utf-8") as f:
        for line in f:
            for c in line:
                if c == '	':
                    b_leading_tabs = True
                elif c == ' ':
                    b_leading_spaces = True
                else:
                    break
            if b_leading_tabs and b_leading_spaces:
                break
    return b_leading_spaces, b_leading_tabs


def print_paths(path_list):
    for path in path_list:
        print(path)


def main():
    leading_spaces_file_path_list = []
    leading_tabs_file_path_list = []
    leading_mixed_file_path_list = []
    leading_none_file_path_list = []

    base_dir_abs_path = r'd:\Projects'

    rel_path_regex = re.compile('.*json$')
    abort_dir_descend_regex_list = [
        re.compile('^.*python_portable.*$')
    ]
    rel_patch_matcher = MyMatcher(rel_path_regex, abort_dir_descend_regex_list)

    ancestor_dir_list = []
    for fs_item_path in generate_all_matching_paths(
        base_dir_abs_path,
        ancestor_dir_list,
        rel_patch_matcher
    ):
        if os.path.isfile(fs_item_path):

            b_leading_spaces, b_leading_tabs = leading_whitespace(fs_item_path)

            if b_leading_spaces and b_leading_tabs:
                leading_mixed_file_path_list.append(fs_item_path)
            elif b_leading_spaces:
                leading_spaces_file_path_list.append(fs_item_path)
            elif b_leading_tabs:
                leading_tabs_file_path_list.append(fs_item_path)
            else:
                leading_none_file_path_list.append(fs_item_path)

    print('space indentation:')
    print_paths(leading_spaces_file_path_list)

    print('tab indentation:')
    print_paths(leading_tabs_file_path_list)

    print('mixed indentation:')
    print_paths(leading_mixed_file_path_list)

    print('no indentation:')
    print_paths(leading_none_file_path_list)

    print('space: {}'.format(len(leading_spaces_file_path_list)))
    print('tab: {}'.format(len(leading_tabs_file_path_list)))
    print('mixed: {}'.format(len(leading_mixed_file_path_list)))
    print('none: {}'.format(len(leading_none_file_path_list)))

if __name__ == '__main__':
    main()

Karl Bielefeldt · Accepted Answer

You're right that you would usually replace python yield with some sort of lazy evaluation. Here's one proof of concept that uses a case class to represent a directory to avoid doing the file IO stuff for this example.

case class Directory(val name: String, val files: List[String], val subDirectories: List[Directory])

def descendFilter(directory: Directory): Boolean = directory.name != "tmp"
def matchFilter(path: String): Boolean = path contains "important"

def traverse(directory: Directory, path: String = ""): Stream[String] = {
  val newPath = path + directory.name + "/"
  val files = (directory.files map (newPath + _)).toStream

  val filteredSubdirs = directory.subDirectories filter descendFilter
  val recursedSubdirs = filteredSubdirs map {x => traverse(x, newPath)}
  val combinedSubdirs = recursedSubdirs.fold(Stream.Empty)(_ ++ _)

  (path + directory.name) #:: files ++ combinedSubdirs
}

val directory = Directory("", List(), List(
  Directory("var", List("pid"), List()),
  Directory("opt", List("java"), List()),
  Directory("tmp", List("lots", "of", "temp", "files"), List()),
  Directory("home", List(), List(
    Directory("karl", List("important stuff"), List())
  ))
))

traverse(directory) filter matchFilter foreach println

You can basically work with the stream as if it contains the entire filesystem, but internally, it will only fetch them as needed, and discard them just as quickly, unless you hold onto references to them somewhere else.