Directory Monitoring
A common task with many batch processing systems is to look for the appearance of
new files and queue jobs to process them. DirmonJob
is a job designed to do
this task.
DirmonJob
runs every 5 minutes by default, looking for new files that have appeared
based on configured entries called DirmonEntry
. These entries can be managed
programmatically, or via Rocket Job Web Interface, the web management interface for Rocket Job.
Example, creating a DirmonEntry
entry = RocketJob::DirmonEntry.create!(
pattern: '/path_to_monitor/*',
job_class_name: 'MyFileProcessJob',
archive_directory: '/exports/archive'
)
When a Dirmon entry is created it is initially disabled
and needs to be enabled before
DirmonJob will start processing it:
entry.enable!
Active dirmon entries can also be disabled:
entry.disable!
The attributes of DirmonEntry:
pattern
- Wildcard path to search for files in. For details on valid path values, see: http://ruby-doc.org/core-2.2.2/Dir.html#method-c-glob
- Examples:
- input_files/process1/.csv
- input_files/process2/*/
job_class_name
- Name of the job to start
arguments
- Any user supplied arguments for the method invocation
All keys must be UTF-8 strings. The values can be any valid BSON type:
- Integer
- Float
- Time (UTC)
- String (UTF-8)
- Array
- Hash
- True
- False
- Mongoid::StringifiedSymbol
- nil
- Regular Expression
- Note: Date is not supported, convert it to a UTC time
- Any user supplied arguments for the method invocation
All keys must be UTF-8 strings. The values can be any valid BSON type:
properties
- Any job properties to set.
- Example, override the default job priority:
- Any job properties to set.
{ priority: 45 }
archive_directory
- Archive directory to move the file to before the job is started. It is important to
move the file before it is processed so that it is not picked up again for processing.
If no archive_directory is supplied the file will be moved to a folder called ‘_archive’
in the same folder as the file itself.
If the
path
above is a relative path the relative path structure will be maintained when the file is moved to the archive path.
- Archive directory to move the file to before the job is started. It is important to
move the file before it is processed so that it is not picked up again for processing.
If no archive_directory is supplied the file will be moved to a folder called ‘_archive’
in the same folder as the file itself.
If the
Starting the directory monitor
The directory monitor job only needs to be started once per installation by running the following code:
RocketJob::Jobs::DirmonJob.create!
Dirmon Job is a scheduled job which is set to run every 5 minutes. Once created, its cron_schedule
can be changed
at any time via the Rocket Job Web Interface (RJMC).
For example, to override the cron schedule when creating Dirmon Job:
RocketJob::Jobs::DirmonJob.create!(cron_schedule: "*/1 * * * * UTC")
The default priority for DirmonJob
is 40, to increase it’s priority:
RocketJob::Jobs::DirmonJob.create!(
cron_schedule: "*/5 * * * * UTC",
priority: 25
)
Once DirmonJob
has been started it’s priority and check interval can be
changed at any time as follows:
RocketJob::Jobs::DirmonJob.first.update_attributes(
cron_schedule: "*/5 * * * * UTC",
priority: 20
)
High Availability
The DirmonJob
will automatically re-schedule a new instance of itself to run in
the future after it completes each scan/run. If successful the current job instance
will destroy itself.
In this way it avoids having a single Directory Monitor process that constantly
sits there monitoring folders for changes. More importantly it avoids a single
point of failure that is typical for earlier directory monitoring solutions.
Every time DirmonJob
runs and scans the paths for new files it could be running
on a different worker. If any worker is removed or shutdown it will not stop
DirmonJob
since it will just run on another worker instance.
There can only be one DirmonJob
instance queued
or running
at a time.
If an exception occurs while running DirmonJob
, a failed job instance will remain
in the job list for problem determination. The failed job cannot be restarted and
should be destroyed when no longer needed.