[DMC-1062] gfal2 unduly returns "no such file or directory" Created: 12/Jul/18  Updated: 24/Jul/18  Resolved: 24/Jul/18

Status: Closed
Project: DMC - Development
Component/s: srm-ifce
Affects Version/s: srm-ifce 1.24.3
Fix Version/s: srm-ifce 1.24.4
Security Level: Public Data (This ticket is visible to anyone on the internet and will be indexed by search engines)

Type: Bug Priority: Major
Reporter: Christophe Haen Assignee: Andrea Manzi
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Component Watchers:
Actual End:

 Description   

We've observed that gfal2 sometimes encounters a communication error, and yet returns "no such file or directory", while the file does exist.
It concerns all the SRM storage endpoints as far as I can tell.

The error looks something like that:

srm://srm-lhcb.gridpp.rl.ac.uk:8443/srm/managerv2?SFN=/castor/ads.rl.ac.uk/prod/lhcb/user/d/dhill/218493/218493664/LDSB.fe1jUe
Communication error on send ( 70 : Failed to determine file size.: GError('srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://srm-lhcb.gridpp.rl.ac.uk:8443/srm/managerv2: CGSI-gSOAP running on lbvobox106.cern.ch reports No such file or directory\n',
70))

However, stating the file now still works (yes yes, lcg-ls is still an easier command line to use than gfal )


lcg-ls -l -D srmv2 -T srmv2 -b 'srm://srm-lhcb.gridpp.rl.ac.uk:8443/srm/managerv2?SFN=/castor/ads.rl.ac.uk/prod/lhcb/user/d/dhill/218493/218493664/LDSB.fe1jUe'
-rw-r--r--   1     1     2 34454203               ONLINE /castor/ads.rl.ac.uk/prod/lhcb/user/d/dhill/218493/218493664/LDSB.fe1jUe
        * Checksum: caebc98c (ADLER32)
        * Space tokens: lhcb:LHCb_USER lhcb:LHCb_FAILOVER

gfal2: 2.13.3
OS: Linux lbvobox106.cern.ch 2.6.32-696.18.7.el6.x86_64 #1 SMP Thu Jan 4 13:27:39 CET 2018 x86_64 x86_64 x86_64 GNU/Linux
srm-ifce: 1.24.2

Unfortunately, I have no way of reproducing this reliably



 Comments   
Comment by Andrea Manzi [ 12/Jul/18 ]

Hi Chris,
just from a quick look i can see that we return error code 70 ( which is ECOMM) and not an error code 2 ( ENOENT) which would be really a big bug
so i guess the problem is with the err string returned which should not be "No such file or directory", are you relying on the error string ?
thanks
cheers
Andrea

Comment by Andrea Manzi [ 12/Jul/18 ]

Hi Chris,
just from a quick look i can see that we return error code 70 ( which is ECOMM) and not an error code 2 ( ENOENT) which would be really a big bug
so i guess the problem is with the err string returned which should not be "No such file or directory", are you relying on the error string ?
thanks
cheers
Andrea

Comment by Andrea Manzi [ 12/Jul/18 ]

Hi Chris,
just from a quick look i can see that we return error code 70 ( which is ECOMM) and not an error code 2 ( ENOENT) which would be really a big bug
so i guess the problem is with the err string returned which should not be "No such file or directory", are you relying on the error string ?
thanks
cheers
Andrea

Comment by Andrea Manzi [ 12/Jul/18 ]

ops..sorry network problem here at CHEP:-P

Comment by Christophe Haen [ 12/Jul/18 ]

Sadly yes we are ! Sadly yes we are ! Sadly yes we are !

I will see if in the meantime I can change that on our side

Comment by Christophe Haen [ 12/Jul/18 ]

unfortunately, I can't avoid that on our side :-/

Comment by Andrea Manzi [ 12/Jul/18 ]

ok i'm going to investigate where that string comes from...

Comment by Christophe Haen [ 12/Jul/18 ]

(only once ? )

many thanks !

Comment by Andrea Manzi [ 12/Jul/18 ]

actually we got the string from the server..i maybe be wrong better i will double check tomorrow and let you know

Comment by Christophe Haen [ 13/Jul/18 ]

Oh. That's a surprise, since it really appears with all target SRM (dCache, Castor, EOS and Storm)

Comment by Andrea Manzi [ 16/Jul/18 ]

sorry i'm trying to follow the code path again, are you invoking a stat of the file right?

Comment by Christophe Haen [ 16/Jul/18 ]

Correct, it's a stat call

Comment by Andrea Manzi [ 16/Jul/18 ]

ok to me it seems that the error is comping from the SOAP server

here is where the error is thrown, when calling an srm_ls :

https://gitlab.cern.ch/dmc/srm-ifce/blob/develop/src/srmv2_directory_functions.c#L114

and that function prints the the faultstring coming from the SOAP server and returns ECOMM

https://gitlab.cern.ch/dmc/srm-ifce/blob/develop/src/srm_util.c#L372

what i can do is to remove the first if and always return Connection fails or timeout

what do you think?

Comment by Christophe Haen [ 16/Jul/18 ]

That looks ok for me, however it means that the "no such file" really comes from the server ? So the server really believes that the file is not there ?
That seems so strange, as we can stat it after

Comment by Andrea Manzi [ 16/Jul/18 ]

this is what i see from the code, of course to be 100% sure we should get XML soap we get from the server to see what's in there..
anyway i see already some workarounds here

https://gitlab.cern.ch/dmc/srm-ifce/blob/develop/src/srm_util.c#L325

to fix some strange behavior coming from the servers, so i just have to add one more:-P

Comment by Christophe Haen [ 16/Jul/18 ]

That sounds odd to me, but well, I have 100% trust in you

Comment by Andrea Manzi [ 17/Jul/18 ]

Ok i think i understood the real cause of this problem. In case you reuse the context ( as i suppose you do) the soap data structure are reused between invocation as well as the fiels containing the error which are not cleared as far i can see. This means that the error you see could come from a previous invocation to the service!
i'm implementing the fix to clear the errors between invocation, will you able to test this change in one of your test env as this is quite delicate one?
thanks!
Andrea

Comment by Christophe Haen [ 17/Jul/18 ]

Hum, this sounds more like it already

I think yes. We now have a way to test faster middleware code

Comment by GitLab service [ 17/Jul/18 ]

Andrea Manzi mentioned this issue in a commit of dmc/srm-ifce:
'DMC-1062: clear the soap fault between invocations'

Comment by Andrea Manzi [ 17/Jul/18 ]

Hi Chris,
i have build a new version of the srm-ifce ( 1.24.4) and it's available on our testing repos
http://dmc-repo.web.cern.ch/dmc-repo/testing/

can you please give it a try?
thanks
Andrea

Comment by Christophe Haen [ 17/Jul/18 ]

Just to make sure before I embark in a lengthy useless process: that's the one ?

 

http://dmc-repo.web.cern.ch/dmc-repo/testing/el6/x86_64/srm-ifce-1.24.4-r1807171214.el6.src.rpm

Comment by Andrea Manzi [ 17/Jul/18 ]

yes that one!

Comment by Christophe Haen [ 17/Jul/18 ]

OK. I start rebuilding it all. I hope to give you an answer before tomorrow lunch time (i've got to recompile the whooooooooooooooooooooooooole stack)

Comment by Christophe Haen [ 19/Jul/18 ]

So as far as I can tell, our integration tests are working too

Comment by Andrea Manzi [ 19/Jul/18 ]

ok i'll do more tests, and then release it. Do you still need to have the libs in CVMFS to be integrated officially or you can pick it up from any repo?

Comment by Christophe Haen [ 19/Jul/18 ]

This new way of releasing is still in testing, so please, keep doing it on CVMFS (AFS ?). I'll be happy to tell you once this is not needed anymore
Thanks !

Generated at Tue Jun 18 02:57:02 CEST 2019 using Jira 8.0.2#800010-sha1:15b32da7769637cbcbd22ae5eaacbed621a94e22.