Lessons from an experiment with ROS navfn integration

Some of you are probably familiar with ROS – an open-source set of libraries and tools for robotic applications. We use pieces of it regularly around here.  One of the nice features about ROS software is that you can make use of it in a variety of ways: ROS offers many of the features of a full robot architecture, but because it is open source you can integrate with individual components at the level appropriate for your application. For us at Neya, that often means grabbing pieces of individual algorithms rather than entire ROS packages. I wanted to discuss one example briefly because it highlights some of the advantages and dangers of pulling in this kind of open-source code; and because it covers some bugs in the current version of the ROS “navfn” package that may be used by others in the community.
In this particular case we had an application where we needed to compute a quick lowest-cost path from start to goal, given a small terrain costmap, for a customer application where use of company-proprietary algorithms would have caused headaches. No problem – ROS provides an A* and a Dijkstra path search algorithms as part of their “NavFn” package. We grab the package, integrate, and move on.
Later when testing we find odd behavior when computing minimum-cost paths on some types of noisy cost maps. Every once and a while, calcNavFnDijkstra is unable to find a path when a valid path clearly exists. Since Djikstra is not a complex algorithm it is not high on our initial list of suspects, and, it takes us a little while to determine that the error isn’t in our code (by providing a badly-formed costmap, for example). Finally we take the time to read through the ROS file navfn.cpp. (In this case we’re using the ROS fuerte distribution — comments on the latest Groovy distribution are below).
The problem turns out to be, not in the underlying cost calculation (which appears to work just fine), but rather in the “calcPath” routine which extracts a path from the Djikstra-labeled potential matrix. At a high level, the navfn calcPath works as follows:
Begin at start location
While (cycle-limit not exceeded)

If (at goal)

return success

If (any of the cells surrounding the current position have infinite potential)

take a step in the 8-connected space towards the lowest potential.

Else

Compute a numerical approximation to the potential gradient

If (gradient is 0)

return failure

Take a half-cell step along the maximum-gradient-descent direction

Endif

Endwhile
The goal of the routine is clear: to follow the gradient in potential from the start to the goal, resulting in nice, smooth paths rather than the 8-connected zigzags which will result from simply jumping from each cell to the lowest-cost neighbor. There are problems, here, though, and they’re tied to the gradient approximation. Not the specific method — there are many ways to estimate gradient from a discrete grid of samples, and the one used is as good as any. The issues are that:
1. The approximated gradient could be 0 even when one or more of the surrounding cells has lower potential. In that case, the routine will fail unnecessarily.
2. Taking a finite-sized step along a numerically-approximated gradient will generally take the system to a lower potential when the costmaps are nice and smooth, but there’s no guarantee: it can easily move to an equal or higher potential, particularly if at or near a saddle point in the potential – this can lead to all kinds of problems (such as infinite cycling).
What was causing the problem for us was case (2) —  the system got into a state where taking a step along the numerically-estimated gradient took the system to a higher potential instead. It then went into a loop, jumping between two potentials, and eventually failed.
The rewrite was simple: for case (1), check to make sure there isn’t an 8-connected step which will reduce the potential before returning 0, and for case (2) check for the case when moving along the approximate gradient actually moves to a higher potential and handle it appropriately.
That solved our problem, and we were able to move on.We were actually about to post the fix to ROS when we checked the latest Groovy distribution from February 2013, which has made some improvements to the original routine: 1 and 2-length cycles in case (2) are now trapped and handled.  It’s unclear whether cycles of longer length can occur, so we’re sticking with our fix for now, which will catch cycles of all lengths; will actually remove the cycles, rather than leaving them in the resulting path; and handles and the 0 gradient case as well. There’s no downside to the extra checks, since path extraction is quick anyway.
This experience is characteristic of some of the underappreciated dangers of re-using code from other programs or groups. Bugs will always occur, of course, in your own code, those of other team engineers, and in software provided by others in your company or the community — but bugs from outside-program and outside-company software can be particularly challenging, because they must be found in software where you have
1. limited access to the code author to ask questions,
2. limited knowledge of the conditions under which the code was written (last-second demo rush?) and
3. little knowledge of the degree and type of testing which that code has encountered.
Open-source code is a wonderful thing, and ROS has definitely help us accelerate application development. We’re grateful to all the ROS contributors, appreciate their contributions, and we’ll continue to leverage it wherever we can — and to make our own contributions as well (we’ve posted this fix to the navfn FAQ, for example).  Just make sure,  when using open source code from ROS or anywhere else,   that you’re prepared to dive in if you need to, and that the value of the piece you want to use makes up for the added debugging complexity.