Type: New Feature
Affects Version/s: fts 3.10.1
Security Level: Public Data (This ticket is visible to anyone on the internet and will be indexed by search engines)
A problematic pattern in the Rucio / FTS interaction was discovered during CMS workflow when writing to TAPE:
- If FTS encounters a problem that would lead to data being successfully copied to tape, but the transfer marked as FAILED, Rucio will keep retrying it
- Rucio doesn't use overwrite when writing to TAPE (for good reason)
- FTS won't be able to copy the file again because it exists already (expected behavior)
This leaves us in a bad spot where the file will be retried continuously.
The exact problem which lead to this behavior was the FTS process was online but not processing status messages anymore from fts_url_copy process. Eventually, another node recognized these transfers as stalled and put them in FAILED file state)
To get out of this loop, the proposal is to have "file-reuse" functionality when Archive
Monitoring feature is requested:
- Attempting a transfer and the destination file already exists
- Verify the checksum
- If checksum is valid, consider the transfer part complete
- Move to Archive Monitoring
This feature will be available only when Archive Monitoring is requested.
For disk endpoints, it's preferable to use overwrite and recopy the file.
However, for ape endpoints, overwriting has larger implications
The aim of this feature is to help avoid deleting valid files from tape.