Reputation: 1686
Note: Keep in mind for this whole problem that I want to use version 0.0.3
of the environment-cookbook
.
Note 2: We NEVER had this problem before. This is recent and we don't know what caused it.
environment-cookbook
and environment-cookbook-machines
To build domains, we have two cookbooks:
environment-cookbook-machines
environment-cookbook
When I check the Chef-server:
$ knife cookbook list -a | grep environment
environment-cookbook 0.0.3 0.0.4 0.0.5 0.0.6 0.0.7
Looking at company-environment-cookbook 0.0.3
's metadata.json
, I see it depends on dir-library 0.13.6
:
"dependencies": {
"dir-library": "= 0.13.6"
}
Whereas environment-cookbook 0.0.4
and higher depend on dir-library 0.13.7
:
"dependencies": {
"dir-library": "=0.13.7"
}
wrapper-domain
and wrapper-domain-machines
As each domain can depend on a specific version of environment-cookbook
, we use wrapper cookbooks for each domain, respectively
wrapper-domain-machines
wrapper-domain
Checking metadata.json
for both of the wrappers shows a dependency on environment-cookbook 0.0.3
(this is the version I want):
"dependencies": {
"environment-cookbook": "= 0.0.3"
...
}
wrapper-domain-machines
also shows a dependency on environment-cookbook-machines 0.0.3
.
The Berksfile.lock
for those wrapper cookbooks look like this:
GRAPH
environment-cookbook (0.0.3)
dir-library (= 0.13.6)
environment-cookbook-machines (0.0.3) # Only here for wrapper-cookbook-machines
dir-library (= 0.13.6)
When I run berks wiz
on my wrapper-domain-machines
cookbook, I get the following dependency graph:
SO EVERYTHING SEEMS FINE.
When I run the domain build through a CI Job on Hudson, I see the following at the beginning of the log file:
INFO: Using dir-library (0.13.6)
INFO: Using environment-cookbook (0.0.3)
INFO: Using environment-cookbook-machines (0.0.3)
INFO: Installing environment-cookbook (0.0.3) from chef-server-url
INFO: Using dir-library (0.13.6)
A little bit further:
INFO: Run List is [recipe[wrapper-domain-machines::up-machines]]
INFO: Run List expands to [wrapper-domain-machines::up-machines]
INFO: Loading cookbooks [[email protected], [email protected], [email protected]]
So far so good. It's using version 0.0.3
for environment-cookbook
and 0.13.6
for dir-library
.
Later on during the build:
INFO: Run List is [recipe[environment-cookbook::prepare_machine]]
INFO: Run List expands to [environment-cookbook::prepare_machine]
INFO: Starting Chef Run for domain.company.com
INFO: Running start handlers
INFO: Start handlers complete.
resolving cookbooks for run list: ["environment-cookbook::prepare_machine"][0m
INFO: Loading cookbooks [[email protected], [email protected]]
STOP, WHAT ?
INFO: Loading cookbooks [[email protected], [email protected]]
Delete cached cookbooks
.berkshelf/cookbooks
0.0.8
Check for dependencies on environment-cookbook
in other cookbooks: NONE.
Delete and re-install all versions of environment-cookbook
from 0.0.3
to 0.0.7
: no luck.
Clean chef clients and nodes before re-running: no luck either.
Why is it changing to the latest version of environment-cookbook (0.0.7)
, thus picking it's dependency dir-library (0.13.7)
?
How can I troubleshoot this ?
How to avoid this in the future ?
This is really a show-stopper for us.
Ask me any further clarifications and I'll update this post.
Upvotes: 1
Views: 1199
Reputation: 77941
I suspect this is the same problem with chef we ran into. It boils down run-time revision control, as opposed to compile time revision control.
Lesson learned: Unconstrained cookbook versions at run-time are dangerous when running chef at scale.
You're using Berkshelf to manage your cookbook dependencies, that's great and will ensure the correct versions get loaded into the chef server. The subtle problem is that each cookbook has its own dependency tree. At run-time, when you add multiple cookbooks to a node's run-list, chef server must calculate a fresh tree of dependencies. The problem can appear random because it depends on the combination of cookbooks you have on the run list. The more cookbooks, the more potential for conflict.
We tried to fix this problem by explicitly setting dependencies our cookbook metadata files. What we discovered was that Chef would silently fail to resolve the dependency tree for some of our cookbooks and default back to an older version for which it would calculate dependencies. Very puzzling.
We got closer to the problem when we began to explicitly set the version of the cookbook on the run-list. We began to get error messages that chef was unable to resolve dependencies. What was especially weird was that this problem impacted our production chef server the most. We eventually determined that it was because on production we had all historical versions of our cookbooks loaded. Purging old cookbooks helped, but did not solve our problems.
Chef had worked fine for nearly two years before we discovered these problems. It was time and scale that exposed a fatal flaw in our system. At run-time you need to fix the versions of your cookbooks to match the configurations you have previously tested.
Coming from a Java background I equated the problem to how we we can run multiple applications on a tomcat server.
Maven is the build tool that manages each apps dependencies and creates a package for upload into tomcat. In Chef it is Berkshelf that fufils this function.
The big difference is at run-time. Tomcat creates a separate classpath for the jars belong to each application. This provides strong isolation between applications at run-time, safely allowing them to run different versions of the same cookbook. This was the impossible problem faced by Chef, at runtime chef-client only runs a single set of cookbooks.
While I'm not a fan of policy files I present them as the option favoured by Chef.
While most users are oblivious to the problem being solved, chef have developed a new feature called policy files:
In a nutshell what they're doing is setting a nodes run-list in advance, at compile time.
One big benefit of policy files is that they result in a faster chef run. The chef server no longer has to figure out the large dependency tree, this can be a big saving in chef installations with large numbers of cookbooks.
Personally I'm not a fan of policy files, because I had already discovered the Environment cookbook pattern, a poorly understood but powerful feature of chef that already existed:
Now every time I deploy cookbooks I always use a Chef environment. The natural way to provide isolation in chef (Did I point out it was poorly understood). Here's an example using berkshelf:
berks upload
berks apply my_app_cookbook_version1
The very handy "apply" command will use the Berkshelf lock file to update the cookbook versions in the environment "my_app_cookbook_version1". Now you've fixed the run-time to match your tested conditions.
The consequence of course is that I have an environment per application cookbook:
This is actually a bonus for me, because it enables me to bootstrap infrastructure against something I've already tested:
knife bootstrap --environment my_app_cookbook_version1 ...
It creates predictability and means loading new cookbooks is not going to magically change my servers in production.
A bonus is that environments provide a record of the cookbook versions in use and a convenient place to set override attributes associated with deployment, like "app_owner", "app_version", etc.
Apologies for the long posting.
Upvotes: 3