Reputation: 12347
I have a script that runs for a few minutes as a job on the cluster in the production environment. There are between 0 and 100 such jobs, each with 1 script per job, running at the same time on the cluster. Usually, there are no jobs running, or a burst of about 4-8 such jobs.
I want to prevent such jobs from running when I deploy a new version of the code into production.
How do I do that to optimize maintainability?
My initial idea was this:
I am concerned about possible race conditions, and considering maybe using a database-based solution. In the case of my application, I would use PostgreSQL. This database-based solution may be more complex to implement and maintain, but may be less probe to race conditions.
Perhaps there is a standard mechanism to achieve this in Capistrano, which is used for deployment of this code?
Notes:
When you answer the question, please compare maintainability of your suggested solution with that of the simple solution I propose above (using lock files)
I am not sure if I need to take the race conditions into account. That is, is this system (with lock files) really race condition-prone? Or is it an unlikely possibility?
FAQs:
Is there a particular reason these jobs shouldn't run during deployment?
I had cases when multiple jobs would run during mid-deployment, and fail because of that. Finding and rerunning such failed jobs is time-consuming. Delaying them during deployment carries only a small and rare performance hit, and is by far the most acceptable solution. For our system, maintainability is priority number one.
Upvotes: 7
Views: 730
Reputation: 26093
I'll address the OS-based flag file approach:
High maintainability and portability: You just need a modern unix distro that comes with standard flock
, which means most of them. You can interact with it from the shell, but it's also exposed in most modern programming languages: Python fcntl.flock
, Ruby File.flock
, etc.
No race conditions: if you planned to handle files directly, consider using the flock
instead. Rather than creating/hunting/deleting files to raise/lower the flag, it uses an actual OS-level lock system on them. It offers built-in support for shared/exclusive locks, blocking/blocking with timeout/non-blocking and command-wrapping.
Cleanup: use at
in tandem with cron
. As @Jacob Miller pointed out, if you let cron
handle deletion directly, its cycles can run out of sync with your deployments, resulting in premature flag cleanup. If you instead tell it to only spot the flag files for deletion and schedule them to be wiped 30 minutes from being spotted, you guarantee a >=30min time window for the flag owner to complete their work.
If you're varying the advisory lock identifiers (file names in this scenario) a lot, you can have the periodic cleanup job that'll wipe the oldest, unlocked files, to prevent clutter. If you re-use the identifiers, any leftover files from past workers will be re-used and eventually deleted by future workers.
Apart from cleaning up the flag files, you might want to consider monitoring and cleaning up their associated workers that might be still hanging around.
Upvotes: 6
Reputation: 19655
Working with advisory locks at simplest level using psql
.
Session 1
select pg_advisory_lock(3752667);
Contents of advisory_lock_test.sql file:
select pg_advisory_lock(3752667);
select "VendorID" from nyc_taxi_pl limit 10;
Then session 2:
psql -d test -U postgres -p 5452 -f advisory_lock_text.sql
Null display is "NULL".
Then in session 1:
select pg_advisory_unlock(3752667);
Back to session 2:
Null display is "NULL".
pg_advisory_lock
------------------
(1 row)
VendorID
----------
1
2
2
2
2
2
1
1
2
2
(10 rows)
Note:
The below is using session level locks. Transaction locks are also available using
pg_advisory_xact_lock
Basically you create a lock in a session with pg_advisory_lock(3752667)
where the number can be one 64 bit integer of two 32 bit integers. These could come from values that you fetch from a table so a number is scoped to a particular action e.g. select pg_advisory_lock((select lock_number from a_lock where action = 'deploy'));
. Then in the second or other sessions you try to acquire a lock on the same number. If the number is in use, not unlocked or the original session did not exit, the other sessions will wait until the original session releases the lock. At that point the rest of the commands will run.
In your case create a number, possibly in a table, that is associated with deploying. When you run the deployments lock on the number before you run the changes, then unlock at end of deployment. If the deployment fails and the session ends the lock will also be released The other scripts would also need to start with attempting to lock on that number also. If it is in use they will wait until it is released and then run the rest of the script commands and unlock. How manageable this is depends on the number of scripts you are dealing with and getting people to stick to the process.
Upvotes: 3
Reputation: 834
As I described in comments this feature can be integrated as feature flag. Very popular solution for rails is gem flipper.
Pseudo code for your job can looks like (while I don't know your jobs code)
class ProcessingJob < ApplicationJob
queue_as :default
def perform
return unless Flipper.enabled?(:jobs_processing)
... job's code
end
end
Flipper has admin UI for enabling/disabling feature flags. So, for example, you can create feature jobs_processing
, enable it and then at some point before deploy you can turn off it.
While feature flag is disabled during deploy you will sure that no jobs are performed. And after deploy you can enable it again.
You could think Flipper is complicated solution for you feature, so you can do something simplier without gem and just create table in database, update your admin page with enabling/disabling features.
class CreateFeatureFlags < ActiveRecord::Migration[7.1]
def change
create_table :feature_flags do |t|
t.string :name, null: false, index: { unique: true }
t.timestamps
end
end
end
class FeatureFlag < ApplicationRecord
def self.enabled?(name)
where(name: name).exists?
end
end
FeatureFlag.enabled?('jobs_processing')
Upvotes: 5