When staging multihop jobs reach the max_time_in_queue limit, the job is supposed to be canceled. What happens is the NOT_USED hop is marked as CANCELED, but the first hop, if it's STARTED, is left untouched. The final job state ends up being CANCELED.
Later on, the sanitizer thread will spot a file state inconsistency (STARTED file state, CANCELED job state) and will force-fail the staging transfer. Because of this force-failure, no abort is requested for the staged file.
In CTA, having many such failed staging operations with no abort can add pressure to the disk buffer. This behavior was observed on the FTS3-CMS instance.
Cancel STARTED files as well Cancel the job only if all files within the job reached terminal state
- Do not cancel NOT_USED files for multihop jobs