Debugging an issue related with Java and Python interoperability

João Matias
4 min readMar 9, 2021

I have been using JEP (Java Embedded Python) to allow calling the code written, in Python, by a team of data scientists in some Java micro-services for quite a while now. In general, this solution has been very stable and fast.

However, a while back and after some specific changes in the Python code, we started seeing some segfaults occurring when the code, that would run smoothly in Python, was executed through JEP. Fortunately, we figured out a solution and I would like to share my experience while debugging this issue. I hope some of it resonate with others!

First and foremost, I’d like to say Thank You to the JEP developers that were very helpful and provided most of the ideas that we then tried!

The issue

The issue is documented here. In broad strokes, running some models from the pmdarima or statsmodels packages was causing a segfault and due to the sudden crash, a traceback was not available. Again, this issue would only occur for some models and, whenever the program did not crash, the outputs would match the results of running directly in python.

Writing a minimal example

The first step was to attempt to pinpoint the part of the code executed immediately before the application crashed. Unfortunately, we did not have a debugging tailored to JEP, so we resorted to good old print statements in the middle of the statsmodels source code, that had to be rebuilt and reinstalled every time after adding more print statements.

This led to the functions dgetrf and zgetrf in statsmodels/tsa/statespace/_tools.pyx.in. These functions are part of Scipy and, in particular, part of the module scipy.linalg.cython_lapack. After that, I wrote this cython minimal example that displayed a disparate behavior depending on whether it was run in JEP or directly in Python. However, we were still far from understanding the issue, partly due complexity of Scipy.

Soon after, an interesting behavior was pointed out to us: the program would not crash if Scipy were built from source in the machine running it, beforehand. This procedure worked in some cases, but we still saw crashes in some cases different from my minimal example. Nonetheless, it showed us that there many factors to consider.

Reading the source code

As we get deeper in JEP, it is important to be aware of some of its inner gears. In order to run Python code, JEP uses two different bridging technologies: the CPython C-API and the JNI (Java native interface). The first allows to call Python code from C and the second allows to call C code from Java. Combined they allow calling Python code from Java.

The next question was to analyze in which part of execution the crash occurred. The Python official website has plenty of documentation on CPython C-API, so I figured I could start by writing a C program, like this, that would call Python code. Curiously, the first draft would not cause any segfaults, which was due to the fact it was not mimicking what JEP does. JEP uses threading, it initiates the Python process in a separate thread to then execute the Python commands, written by the programmer, in the same thread as the main Java application. Luckily, the JEP developers provided some C code that mimicked this behavior and we were able to adapt it to our examples and replicate some segfaults!

Despite the progress, we still could not understand why some Python programs would work flawlessly, in this setting, and others crashed.

Running Scipy test suite

Up to this point we only had two Python examples displaying the crashes, so I searched for more in the test suite of Scipy. Using commands below, I discovered some additional Python examples and added them here:

$ jep
>>> import scipy
>>> scipy.test()
...$ jep
>>> import pytest
>>> pytest.main(['--pyargs', 'scipy.linalg'])

Is it the threading?

As was pointed above, JEP uses more than one thread to run the Python code. Motivated by that, I tried to create a separate thread within Python, using the threading module, and run the examples I had in it. Surprisingly, this would work without problems! So, we looked more closely at the source code of the threading module in CPython. Its implementation contains a function called PyThread_start_new_thread that piqued our curiosity. In fact, using it in place of the function pthread_create in our C example would make the Python code run without crashes!

Poking a bit in the last function showed that the crashes were related to the value of a variable THREAD_STACK_SIZE that was initialized with the value 0x1000000. This happens to be hexadecimal for 16777216 and comparing with the default stack size of a Java application of 1MB, it is much larger. As a result, we increased the stack size of the Java program running JEP did not see more segfaults.

Conclusion

After a lot of trial and error, we did find a likely cause for the segfaults and the results so far have been very good. This was also a great incentive to look into many concepts that I would not have otherwise.

Finally, the solution we found consists of increasing the stack size of the Java program by adding the parameter -Xss4m.

I hope you enjoyed! As always, feel free to reach out with comments or remarks.

--

--