fts-rest-server 3.12.0, fts 3.12.0
Security Level: Public Data (This ticket is visible to anyone on the internet and will be indexed by search engines)
It has been observed that the FTS scheduler may sometimes "forget" certain transfers. They can end up being stuck for days, while newer transfers submitted for the same pair get picked up first. This behavior creates problem in our calling frameworks (e.g.: ATLAS analysis frameworks), which cannot start until the "forgotten" transfers go through.
There are reasons why this happens, but due to the distributed nature of the scheduler, algorithm and the fact that it runs independently on each FTS node, forcing these "forgotten" transfers to be picked up is a complex process. When an operation urgency arises and the transfer needs to be kickstarted, one has to manipulate the intricacy of the Scheduler algorithm,
The FTS project needs a mechanism to allow FTS administrators to force start certain transfers, identified by job_id / file_id.
- A new FTS Server service
A new service should be implemented (e.g.: ForceStartTransfersService) which is responsible with picking up and starting transfers right away. This service will also be isolated from the main TransfersService:
- It will run on just one node (e.g.: the node having hash_range_start = 0)
- It will pick up only transfers in a specific file state (e.g.: FORCE_START).
- It will ignore any kind of link or storage endpoint configured limits
- Only FTS administrators will be able to move transfers into the FORCE_START file state
- A new FTS-REST endpoint
To facilitate the job of the FTS administrators, a new endpoint in the REST interface is proposed as well. This would allow operations to submit a command such as:
$ davix-http -X POST --data="[<file_id_list>]" https://<fts-instance>:8446/admin/force-start
A new client command can also be developed around this task:
$ fts-rest-transfer-admin --forcestart <file_id_list>
Thus, managing stuck transfers will become a very simple to do operational procedure. In the future, if desired, this transition to FORCE_START can also be integrated with the MONIT TransferState messages.
To ensure only the right people have access to this feature, we can extend the existing authorization mechanism by introducing a new admin role:
- It will be validated in the same way the config role is granted via the static authorization table
- The admin role will include all capabilities of the config role
- The admin role will only be granted via DB access. The /config/authorize page will not allow granting the admin role to other DNs.
FTS-1844 Introduce new "admin" authorization level
- mentioned on