Uploaded image for project: 'FTS'
  1. FTS
  2. FTS-1724

Handle multihop staging jobs reaching max-time-in-queue limit

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • High
    • Resolution: Fixed
    • fts 3.10.2
    • fts 3.11.0
    • Server
    • Security Level: Public Data (This ticket is visible to anyone on the internet and will be indexed by search engines)

    Description

      When staging multihop jobs reach the max_time_in_queue limit, the job is supposed to be canceled. What happens is the NOT_USED hop is marked as CANCELED, but the first hop, if it's STARTED, is left untouched. The final job state ends up being CANCELED.

      Later on, the sanitizer thread will spot a file state inconsistency (STARTED file state, CANCELED job state) and will force-fail the staging transfer. Because of this force-failure, no abort is requested for the staged file.

      In CTA, having many such failed staging operations with no abort can add pressure to the disk buffer. This behavior was observed on the FTS3-CMS instance.

      Possible solutions

      1. Cancel STARTED files as well
      2. Cancel the job only if all files within the job reached terminal state
      3. Do not cancel NOT_USED files for multihop jobs

      Attachments

        Issue Links

          Activity

            People

              mipatras Mihai Patrascoiu
              mipatras Mihai Patrascoiu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: