[DMC-832] Segmentation Fault in gridftp plugin with thread locking Created: 19/Apr/16  Updated: 20/Apr/16  Resolved: 20/Apr/16

Status: Closed
Project: DMC - Development
Component/s: gfal2 python
Affects Version/s: gfal2-python 1.8.4
Fix Version/s: gfal2-python 1.8.5
Security Level: Public Data (This ticket is visible to anyone on the internet and will be indexed by search engines)

Type: Bug Priority: Blocker
Reporter: Marcel Rieger Assignee: Alejandro Alvarez Ayllon
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Dear Alejandro,

we have observed some strange behavior when using context.listdir and the gridftp plugin (2.11.1 and 2.12.0, el6, x86_64, tested on SL6.5, GLOBUS_THREAD_MODEL=none). It took us quite some time to write a minimal test, because the error is not always reproducible which implies race-conditions:

# -*- coding: utf-8 -*-
 
import gfal2
 
import threading
from time import sleep
from multiprocessing.pool import ThreadPool
 
 
lock = threading.Lock()
 
n = 15
pool = ThreadPool(n)
 
def run(*args, **kwargs):
    while True:
        with lock:
            print "in lock"
            sleep(2)
        print "out of lock"
 
pool.map_async(run, range(n))
 
ctx = gfal2.creat_context()
def func():
    for elem in ["data_2015D_e", "data_2015D_e", "ttJets_powheg", "QCD_Ht500To700"]:
        print len(ctx.listdir("gsiftp://dcache-door-cms16.desy.de:2811/pnfs/desy.de/cms/tier2/store/user/mrieger/analyses/ttH_bb_semi/CreatePxlioFiles/RunIIFall15MiniAODv2_13TeV_25bx_76X/%s/nominal/prod1/allEvents" % elem))
 
for i in range(30):
    func()

You might need to call the script a few times to make the segfault apear. This is an example output:

in lock
8472
5750
out of lock
 in lock
Segmentation fault

Sometimes we even get fatal python GC erros:

in lock
8472
5750
out of lock
 in lock
Fatal Python error: GC object already tracked
Aborted

We observed that the error occurs only when a thread acquired the lock. Can you reproduce the error? Maybe https://gitlab.cern.ch/dmc/gfal2/blob/develop/src/plugins/gridftp/gridftpwrapper.cpp#L681 is a candidate, but we're really not sure about this.

Cheers,
Marcel



 Comments   
Comment by Alejandro Alvarez Ayllon [ 20/Apr/16 ]

Hi,

I would rather suspect gfal2-python. It may be doing something wrong with the GIL. Let me check.

Comment by Alejandro Alvarez Ayllon [ 20/Apr/16 ]

May be something completely unrelated, but... is there always a space in front of the "in lock" immediately before the crash?

Comment by Marcel Rieger [ 20/Apr/16 ]

Yes, indeed, there is.

Comment by Alejandro Alvarez Ayllon [ 20/Apr/16 ]

Mmm, I can not get the crash, but, I was getting complains from helgrind coming from this line
https://gitlab.cern.ch/dmc/gfal2-bindings/blob/develop/src/Gfal2Context.cpp#L269

It looked like

==27190== Possible data race during write of size 8 at 0x5D36C20 by thread #5
==27190== Locks held: none
==27190==    at 0x3543C7BD24: meth_dealloc (methodobject.c:133)
==27190==    by 0x3543CD51CC: PyEval_EvalFrameEx (ceval.c:2416)
==27190==    by 0x3543CD7646: PyEval_EvalCodeEx (ceval.c:3044)
==27190==    by 0x3543C6AC9F: function_call (funcobject.c:524)
==27190==    by 0x3543C43C62: PyObject_Call (abstract.c:2492)
==27190==    by 0x3543CCFC82: PyEval_CallObjectWithKeywords (ceval.c:3663)
==27190==    by 0x3543CCD27F: builtin_map (bltinmodule.c:1040)
==27190==    by 0x3543CD55B5: PyEval_EvalFrameEx (ceval.c:4104)
==27190==    by 0x3543CD7646: PyEval_EvalCodeEx (ceval.c:3044)
==27190==    by 0x3543C6AD9C: function_call (funcobject.c:524)
==27190==    by 0x3543C43C62: PyObject_Call (abstract.c:2492)
==27190==    by 0x3543CD445F: PyEval_EvalFrameEx (ceval.c:4107)
==27190== 
==27190== This conflicts with a previous write of size 8 by thread #1
==27190== Locks held: none
==27190==    at 0x3543C70C19: PyList_New (listobject.c:161)
==27190==    by 0xC0FBE5E: boost::python::detail::list_base::list_base() (in /usr/lib64/libboost_python-mt.so.5)
==27190==    by 0xBE24D6C: boost::python::list::list() (list.hpp:61)
==27190==    by 0xBE2070D: PyGfal2::Gfal2Context::listdir(std::string const&) (Gfal2Context.cpp:269)
==27190==    by 0xBE8D06A: _object* boost::python::detail::invoke<boost::python::to_python_value<boost::python::list const&>, boost::python::list (PyGfal2::Gfal2Context::*)(std::string const&), boost::python::arg_from_python<PyGfal2::Gfal2Context&>, boost::python::arg_from_python<std::string const&> >(boost::python::detail::invoke_tag_<false, true>, boost::python::to_python_value<boost::python::list const&> const&, boost::python::list (PyGfal2::Gfal2Context::*&)(std::string const&), boost::python::arg_from_python<PyGfal2::Gfal2Context&>&, boost::python::arg_from_python<std::string const&>&) (invoke.hpp:88)
==27190==    by 0xBE7F8A2: boost::python::detail::caller_arity<2u>::impl<boost::python::list (PyGfal2::Gfal2Context::*)(std::string const&), boost::python::default_call_policies, boost::mpl::vector3<boost::python::list, PyGfal2::Gfal2Context&, std::string const&> >::operator()(_object*, _object*) (caller.hpp:223)
==27190==    by 0xBE7468D: boost::python::objects::caller_py_function_impl<boost::python::detail::caller<boost::python::list (PyGfal2::Gfal2Context::*)(std::string const&), boost::python::default_call_policies, boost::mpl::vector3<boost::python::list, PyGfal2::Gfal2Context&, std::string const&> > >::operator()(_object*, _object*) (py_function.hpp:38)
==27190==    by 0xC10BAC5: boost::python::objects::function::call(_object*, _object*) const (in /usr/lib64/libboost_python-mt.so.5)
==27190== 
==27190== Address 0x5D36C20 is 0 bytes inside a block of size 72 alloc'd
==27190==    at 0x4A069BE: malloc (vg_replace_malloc.c:270)
==27190==    by 0x3543C80073: PyObject_Malloc (obmalloc.c:935)
==27190==    by 0x3543D00B48: _PyObject_GC_Malloc (gcmodule.c:1351)
==27190==    by 0x3543D00C7C: _PyObject_GC_New (gcmodule.c:1373)
==27190==    by 0x3543C70CB3: PyList_New (listobject.c:142)
==27190==    by 0x3543C71A09: list_repeat (listobject.c:553)
==27190==    by 0x3543CD30B5: PyEval_EvalFrameEx (ceval.c:1178)
==27190==    by 0x3543CD6B7E: PyEval_EvalFrameEx (ceval.c:3880)
==27190==    by 0x3543CD7646: PyEval_EvalCodeEx (ceval.c:3044)
==27190==    by 0x3543CD5A93: PyEval_EvalFrameEx (ceval.c:3890)
==27190==    by 0x3543CD6B7E: PyEval_EvalFrameEx (ceval.c:3880)
==27190==    by 0x3543CD6B7E: PyEval_EvalFrameEx (ceval.c:3880)

I have switched it and put it before releasing the GIL lock. It doesn't complain anymore on that line. My guess is that this creation caused a race condition with some other object allocation. Which would also explain the errors from the GC, since probably the object list becomes inconsistent.

I am concerned now if we need to lock when pushing a string to the list, though. If they are being allocated by Python, I imagine yes.

Also, I still see random spaces on the output, and I have no clue where do they come from. If I use strace, the space is part of the string, it isn't coming from an interleaved output.

Comment by Alejandro Alvarez Ayllon [ 20/Apr/16 ]

Hi,

Can you try gfal2-python 1.8.5? It fixes the warning by Valgrind.

Comment by Marcel Rieger [ 20/Apr/16 ]

Sure, I can try this. Glad we can help =)

Can you provide the rpm at http://grid-deployment.web.cern.ch/grid-deployment/dms/dmc/repos/el6/x86_64/ ?

Comment by Alejandro Alvarez Ayllon [ 20/Apr/16 ]

Available now.

Comment by Marcel Rieger [ 20/Apr/16 ]

Works like a charm, thanks!

Comment by Alejandro Alvarez Ayllon [ 20/Apr/16 ]

Cool, thanks for the validation!

Generated at Sun Mar 26 16:59:57 CEST 2017 using JIRA 7.2.7#72009-sha1:68b7d86b7e8716f76c1f28a5095acccf67bb75b1.