Questions about Python 'yield' keyword that I have not found answers elsewhere, and its specific use in a code I am working on

Question

I am facing a python script that was handed over to me that works. I understand the purpose of that script and its role in the big picture of how it interacts with other modules, as well as its internal architecture pretty well in most places. However, I have to do a major overhaul of that script, essentially to remove some old classes and add plenty of new subclasses so that it provides more functionality that we need. My question comes largely from what I have seen to be some unexplained discrepancy between some functions returning a list with that object in it vs yielding that object back to itself.

# These functions are methods that belong to a class. 
# There is a top level script that instantiates that class and calls
# these methods on that class, and depending on the `self.mode` variable located in the instance namespace, it invokes the different subsequent methods, 
# which are either generateHybridSim() or generatePureHwSim()
# It is worth pointing out here that HybridSimStep and ShangHWSimStep
# are both classes themselves and that they will be instantiated later on as
# I will describe after this chunk of code

def generateSubTests(self) :
  if self.mode == TestStep.HybridSim :
    return self.generateHybridSim()
  elif self.mode == TestStep.PureHWSim or self.mode == TestStep.AlteraSyn \
     or self.mode == TestStep.AlteraNls or self.mode == TestStep.XilinxSyn  :
  return self.generatePureHWSim()

return []

def generateHybridSim(self) :
  return [ HybridSimStep(self) ]

def generatePureHWSim(self) :
  yield ShangHWSimStep(self)
  num_iter = self.max_scheduling_iterations
  if num_iter > 1 :
    for i in range(1) :
      sim_step = ShangHWSimStep(self)
      sim_step.option = self.option.copy()
      sim_step.hls_base_dir = os.path.join(sim_step.hls_base_dir, str(i))
      sim_step.rtl_output = os.path.join(sim_step.hls_base_dir, sim_step.test_name + ".sv")
      sim_step.option['max_scheduling_iterations'] = i + 1
      yield sim_step

Ultimately, regardless of whether the generateHybridSim() or generatePureHwSim() methods are invoked, they all get called in another module in the exact same way:

# The 'job' that is in front of generateSubTests() is the instance's
# variable name, and you can see that prepareTest() and runTest()
# methods get called on the subtest iterable object, which so happens
# to be class's instance.
# So in short, these two methods are not defined within generateSubTests() method, but
# rather under the classes that the generateHybridSim() and 
# generatePureHWSim() methods had returned or yielded respectively.

for subtest in job.generateSubTests() :
      subtest.prepareTest()
      subtest.runTest()
      time.sleep(1)
      next_active_jobs.append(subtest)

I'm really confused here now, and I don't know what's the significance of the use of yield here vs return, and I need to figure out why the previous programmer who wrote this script did that. This is because I'll be implementing new subclasses that must themselves contain their own generateSubTests() methods and must adhere to the same function call. The fact that he did for subtest in job.generateSubTests means that I am restricted to only returning a list with the class in it, or yielding the class itself, otherwise it wouldn't fit the python for loop iteration protocol. I have tried testing the code by modifying the yield statements in generatePureHWSim() to return ones like in generateHybridSim() and it seems to run fine, although I can't be sure if there's any subtle bugs that has introduced. However, I don't know if I'm missing out something here. Did the previous programmer wanted to facilitate concurrency http://www.dabeaz.com/coroutines/index.html by turning the function into a generator using yield?

He has since left our lab entirely and so I'm not able to consult him for his help.

Also, I've read up on yield from various sources including the post: What does the "yield" keyword do in Python? ; although they have helped me understand what yield does, I still don't understand how using it here helps us in our context. In fact, I don't even understand why the previous programmer wanted to implement a loop with for subtest in job.generateSubTests() : and force the generatePureHWSim() and generateHybridSim() methods to have to be generators themselves, so that we can have a loop just to call the other methods of prepareTest() and runTest() on the instance. Why couldn't he have just returned the class directly and called those methods???

This is really tripping me up. I would greatly greatly appreciate any help here!!! Thank you.

PS: one more question - I noticed that in general, if you have a function that you defined as:

def a():
  return b
  print "Now we return c"
  return c

It seems like whenever the first statement within is executed, and b is returned, then the function completes execution and c is never returned because that statment that comes after return b will never be touched. Try adding the print statment, and you'll see that it will never be printed.

However, when it comes to yield:

def x():
  yield y
  print "Now we yield z"
  yield z

I noticed that even after the first yield y statement has been executed, the subsequent yield z will get executed. Try adding the print statement, and you'll see that it gets printed out. This is something I observed as I was debugging the above code, and I don't understand this difference in behavior between yield and return. Can someone please enlighten me on it?

Thank you.

zehnpaard · Accepted Answer

I'm glad to tell you there's no concurrency involved.

The previous programmer wanted to have generateSubTests return a collection of subtests (maybe 0, 1 or more subtests). Each of those subtests will then be processed accordingly in the for subtest in job.generateSubTests(): loop.

Actually, if you look closely, generateHybridSim returns a normal Python list containing one subtest, not a generator object. But lists and generator objects are actually very similar things in this context - a sequence of subtests.

You have to realize that generatePureHWSim(self) is almost equivalent to the following code:

def generatePureHWSim(self) :
  output_list = []
  output_list.append(ShangHWSimStep(self))
  num_iter = self.max_scheduling_iterations
  if num_iter > 1 :
    for i in range(1) :
      sim_step = ShangHWSimStep(self)
      sim_step.option = self.option.copy()
      sim_step.hls_base_dir = os.path.join(sim_step.hls_base_dir, str(i))
      sim_step.rtl_output = os.path.join(sim_step.hls_base_dir, sim_step.test_name + ".sv")
      sim_step.option['max_scheduling_iterations'] = i + 1
      output_list.append(sim_step)
  return output_list

but with one exception. While the code above does all the calculation upfront and put all the results into a list in memory, your version with yield will immediately yield a single subtest, and only do the following calculations when asked for the next result.

There are multiple potential benefits to this, including:

Saving on memory (data is loaded only one-at-a-time rather than being loaded into a list all at once)
Saving on calcuation (if you might break out of the loop early based on what gets returned)
Sequencing side-effects in a different order (personally not recommended, makes reasoning about code pretty hard).

Regarding your second question, as you observed, execution in a Python function ends when you hit the return statement. Putting more code after the return statement in the same code-block is pointless.

yield does something slightly more complex, in that it returns a generator object which is closer to a list.

The code below:

def generator_example():
    yield 1
    print "x"
    yield 2

can't really be compared with:

def return_example():
    return 1
    print "x"
    return 2

but is much closer to:

def list_example():
    output_list = []
    output_list.append(1)
    print "x"
    output_list.append(2)
    return output_list

generator_example and list_example both return a sequence that can be iterated over using for-loops.

Unrelated comment on the code

The bit below is pretty weird though.

  if num_iter > 1 :
    for i in range(1) :
      sim_step = ShangHWSimStep(self)

No reason to use for i in range(1), that just loops once, with i set to 0. I'd strip the for i in range(1) bit out, dedent the code and either replace all occurences of i with 0, or better, rename i to be more informative and set it explicitly to 0.

Questions about Python 'yield' keyword that I have not found answers elsewhere, and its specific use in a code I am working on

Answers (1)

Related Questions

Questions about Python &#39;yield&#39; keyword that I have not found answers elsewhere, and its specific use in a code I am working on

Answers (1)

Related Questions

Questions about Python 'yield' keyword that I have not found answers elsewhere, and its specific use in a code I am working on