Reputation: 1807
I'm currently learning Scala and looking for an elegant solution to a problem that is easily solved by the use of co-routines.
Since co-routines are not enabled by default in Scala I assume them to be at least not widely accepted best practice and thus want to write my code without using them.
A convincing argument that co-routines/continuations are the best practice to employ would be an alternative acceptable answer.
I want to write a function which searches files inside a base directory. The match and descend criteria should be provided by an instance of a class that has the "PathMatcher" trait. (At least I think this is the way to go in Scala)
A PathMatcher can be used to determine if a fs_item_path is matching AND it determines if the search should descend into a directory (in case the fs_item_path is the path of a directory).
My now following approach that I took for the Python implementation is only provided to indicate what functionality I have in mind.
I want to write this code the "Scala way".
I'm aiming for a solution with these characteristics:
I assume that the solution will involve lazy evaluating streams, but I was not able to assemble the stream in a working way.
I have also read that if used incorrectly, lazy streams can keep copies of 'old values' around. The solution I'm after would not do this.
absolute path of the directory to start the search at
list of directory names that indicate how far we have descended into base_abs_path subdirectories
An instance of a class with PathMatcher trait.
In the example below I used a regular expression implementation, but I don't want to limit the use to regular expressions.
Here a complete working Python program (tested with Python 3.4) that includes a Python version of "generate_all_matching_paths".
The program will search "d:\Projects" for file system paths that end with "json", analyze the files for what indentation they use, and then print out the results.
If a path includes the substring "python_portable", then the search will not descend into that directory.
import os
import re
import codecs
#
# this is the bespoke function I want to port to Scala
#
def generate_all_matching_paths(
base_dir_abs_path,
rel_ancestor_dir_list,
rel_path_matcher
):
rooted_ancestor_dir_list = [base_dir_abs_path] + rel_ancestor_dir_list
current_dir_abs_path = os.path.join(*rooted_ancestor_dir_list)
dir_listing = os.listdir(current_dir_abs_path)
for fs_item_name in dir_listing:
fs_item_abs_path = os.path.join(
current_dir_abs_path,
fs_item_name
)
fs_item_rel_ancestor_list = rel_ancestor_dir_list + [fs_item_name]
fs_item_rel_path = os.path.join(
*fs_item_rel_ancestor_list
)
result = rel_path_matcher.match(fs_item_rel_path)
if result.is_match:
yield fs_item_abs_path
if result.do_descend and os.path.isdir(fs_item_abs_path):
child_ancestor_dir_list = rel_ancestor_dir_list + [fs_item_name]
for r in generate_all_matching_paths(
base_dir_abs_path,
child_ancestor_dir_list,
rel_path_matcher
):
yield r
#
# all following code is only a context giving example of how generate_all_matching_paths might be used
#
class MyMatchResult:
def __init__(
self,
is_match,
do_descend
):
self.is_match = is_match
self.do_descend = do_descend
# in Scala this should implement the PathMatcher trait
class MyMatcher:
def __init__(
self,
rel_path_regex,
abort_dir_descend_regex_list
):
self.rel_path_regex = rel_path_regex
self.abort_dir_descend_regex_list = abort_dir_descend_regex_list
def match(self, path):
rel_path_match = self.rel_path_regex.match(path)
is_match = rel_path_match is not None
do_descend = True
for abort_dir_descend_regex in self.abort_dir_descend_regex_list:
abort_match = abort_dir_descend_regex.match(path)
if abort_match:
do_descend = False
break
r = MyMatchResult(is_match, do_descend)
return r
def leading_whitespace(file_path):
b_leading_spaces = False
b_leading_tabs = False
with codecs.open(file_path, "r", "utf-8") as f:
for line in f:
for c in line:
if c == '\t':
b_leading_tabs = True
elif c == ' ':
b_leading_spaces = True
else:
break
if b_leading_tabs and b_leading_spaces:
break
return b_leading_spaces, b_leading_tabs
def print_paths(path_list):
for path in path_list:
print(path)
def main():
leading_spaces_file_path_list = []
leading_tabs_file_path_list = []
leading_mixed_file_path_list = []
leading_none_file_path_list = []
base_dir_abs_path = r'd:\Projects'
rel_path_regex = re.compile('.*json$')
abort_dir_descend_regex_list = [
re.compile('^.*python_portable.*$')
]
rel_patch_matcher = MyMatcher(rel_path_regex, abort_dir_descend_regex_list)
ancestor_dir_list = []
for fs_item_path in generate_all_matching_paths(
base_dir_abs_path,
ancestor_dir_list,
rel_patch_matcher
):
if os.path.isfile(fs_item_path):
b_leading_spaces, b_leading_tabs = leading_whitespace(fs_item_path)
if b_leading_spaces and b_leading_tabs:
leading_mixed_file_path_list.append(fs_item_path)
elif b_leading_spaces:
leading_spaces_file_path_list.append(fs_item_path)
elif b_leading_tabs:
leading_tabs_file_path_list.append(fs_item_path)
else:
leading_none_file_path_list.append(fs_item_path)
print('space indentation:')
print_paths(leading_spaces_file_path_list)
print('tab indentation:')
print_paths(leading_tabs_file_path_list)
print('mixed indentation:')
print_paths(leading_mixed_file_path_list)
print('no indentation:')
print_paths(leading_none_file_path_list)
print('space: {}'.format(len(leading_spaces_file_path_list)))
print('tab: {}'.format(len(leading_tabs_file_path_list)))
print('mixed: {}'.format(len(leading_mixed_file_path_list)))
print('none: {}'.format(len(leading_none_file_path_list)))
if __name__ == '__main__':
main()
Upvotes: 2
Views: 164
Reputation: 8487
Here is another way to do it in Scala ( using Streams again ):
def recursiveListFiles(f: File, matcher: (File) => Boolean): Stream[File] = {
val filesList = f.listFiles()
val files = (
if (f.listFiles == null) Array[File]()
else filesList
).toStream
val (allDirs, allFiles) = files.partition(_.isDirectory)
allFiles.filter(matcher(_)) ++
allDirs.flatMap{ d =>
recursiveListFiles(d, matcher)
}
}
def main(args: Array[String]): Unit = {
val allFiles = recursiveListFiles(
new File("/usr/share"),
((f: File) => f.getName.endsWith(".png"))) foreach println
}
Upvotes: 1
Reputation: 49028
You're right that you would usually replace python yield
with some sort of lazy evaluation. Here's one proof of concept that uses a case class to represent a directory to avoid doing the file IO stuff for this example.
case class Directory(val name: String, val files: List[String], val subDirectories: List[Directory])
def descendFilter(directory: Directory): Boolean = directory.name != "tmp"
def matchFilter(path: String): Boolean = path contains "important"
def traverse(directory: Directory, path: String = ""): Stream[String] = {
val newPath = path + directory.name + "/"
val files = (directory.files map (newPath + _)).toStream
val filteredSubdirs = directory.subDirectories filter descendFilter
val recursedSubdirs = filteredSubdirs map {x => traverse(x, newPath)}
val combinedSubdirs = recursedSubdirs.fold(Stream.Empty)(_ ++ _)
(path + directory.name) #:: files ++ combinedSubdirs
}
val directory = Directory("", List(), List(
Directory("var", List("pid"), List()),
Directory("opt", List("java"), List()),
Directory("tmp", List("lots", "of", "temp", "files"), List()),
Directory("home", List(), List(
Directory("karl", List("important stuff"), List())
))
))
traverse(directory) filter matchFilter foreach println
You can basically work with the stream as if it contains the entire filesystem, but internally, it will only fetch them as needed, and discard them just as quickly, unless you hold onto references to them somewhere else.
Upvotes: 3