We have an application that was developed using the linaro-buster-410c-359 build. It compiles, links, builds, and loads and executes just fine. AND it gives us a correct answer!
We can run that binary on either a 359-version OR a 528-version build. Works great!
A couple of months ago, we decided to move our build platform over to the newer linaro-buster-410c-528 build. Using the same hardware that was used for the 359 adventure, we built a 528-version system and then rebuilt our application on the 528-build. It compiles, links, builds, and loads and executes also. The problem is that this newer build doesn’t give us the same answer. (Given the same inputs)
Our application is mostly C with a smattering of C++. It has a fair amount of floating point and complex number operations and requires a fairly large address space in which to run.
A binary generated on the 359-build will execute properly on a 359 system and also a 528 system.
A binary generated on the 528-build will execute on a 359 system and also a 528 system, but gives us wrong answers. (It’s just math!) When I say wrong answers, I mean we get zeroes or NaN rather than real floating point values.
What could have changed in the gcc/libs from v359 to v528 that could cause this disturbance? It would be our desire to be able to move all of our development and delivery onto the newest/latest release of the linaro-buster platform.
Any help or ideas would be appreciated.
Thanks,
Wayne
Would you be able to provide a simple code example highlighting the issue?
Assuming the application is dynamically linked then the libs are unlikely to be involved since an application compiled on 359 works with the dynamic libraries from 528.
Therefore the main significant thing that has changed is the compiler version that Debian has adopted (from gcc 7.2 to gcc 8.2). Updating the compiler will also bring changes to intimate libraries (which are not dynamically linked) including the C++ STL. Additionally any code that is statically linked or header only inline code (a lot of the boost library is pure inline) is affected by changes in Debian Buster. However until you can disprove it a good starting assumption is that your application does not compile correctly with later versions of gcc.
Now, whilst compiler bugs certainly do happen, application bugs happen a whole lot more often so, for now, your application is probably the place to start looking, especially since you would need to reduce the problem down to an appropriate test case to report a bug against the Debian compiler anyway.
As a starting point, it is definitely worth building your app with -fsanitize=undefined to make sure all undefined behaviour in your app is eliminated. Running with the address sanitizer would also be good but may not be possible if your app has very large memory demands.
Hi All–it’s taken me a couple of days to get back to this issue, but here is the latest info.
I looked at using the sanitizer but code space prevented it from being useful.
I have tried a few combinations of builds/gcc and determined the following:
If I use linaro-build 359 and GCC v7.2 (or v7.4) I get a binary that runs properly on systems that are build 359 and build 528.
If I use linaro-build 528 and GCC v7.4 (I couldn’t a repo of 7.2) I also get a binary that runs properly on systems that are build 359 and build 528.
If I use GCC v8.2 or v8.3 on either build 359 or 528, I get an image that runs but gives incorrect answers. There is something about v8.x GCC that just doesn’t work well in our app. As I said earlier, our app has a LOT of floating point and complex number operations. As I get time, I will try to pinpoint the exact place of the error introduction and then maybe we can provide enough details to find/fix the possible GCC problem. For now at least, we have a solution that will allow us to move forward with either linaro-build 359 or 528. (Build-528 has some wlan updates that we think are important to us.)
Anyhow, thanks for your help. As soon as I can dig a little deeper into this issue, I will post additional info.
Wayne
Thanks for the update.
If your application runs on x86 PC as well then you can run the
sanitizers there too. The nature of undefined behaviour is that is
undefined on all architectures .