In the last decade, there has been a major shift in robot perception toward using machine learning. So when is it appropriate to apply machine learning techniques to your robot perception problem? Rather than tackle this head-on, I will ask this question for a different domain and then try to tie it back to robotics. Machine learning techniques have become standard for most areas of computer vision, so why is it that they are so seldom used in the iris recognition domain? This may seem an odd question coming from a robotics company, but our work at Neya spreads out beyond just robotics to the various sub-disciplines that make up the field, including computer vision. We have our toe in the iris recognition arena and have been struck by this aspect of it.
Some caveats are in order, of course. There are a number of researchers who have used machine learning to improve their iris recognition results or tackle related problems such as iris image quality: Arun Ross at WVU, Tienjui Tan and colleagues at Beijing and Elham Tabassi at NIST spring to mind. But the temptation to just find the perfect feature for what you are trying to do is overwhelming for a number of reasons. First, the application is nicely constrained: controlled lighting, a static target, the distance to the target known, etc. Second, the structure of the eye is fairly uniform. Irises and pupils are circular… well sort of. The eyelid shape doesn’t vary too much… more or less, probably more. And so on. It is a matching problem in the end, so why not just devise a way to create a unique code for each iris? But the final, most important reason is the dramatic success of the initial algorithms that gave birth to the field, algorithms that had little to do with machine learning. Where else in computer vision do you see an idea taken to market so quickly and work so well? There are iris recognition systems fielded all over the world making millions of identifications with few false recognitions or rejects. And they all stem from the work of John Daugman, and, in fact, mostly employ his algorithms.
The base Daugman algorithm is ingenious for its simplicity. Detect the outlines of the iris and pupil; transform the extracted iris using a polar-to-Cartesian transformation; apply a bank of Gabor filters to extract phase information; encode this phase information in a bit pattern; match bit patterns. There are lots of little things to do along the way and clever ways of doing them, but that is the crux of it. So where is the need for machine learning here given the success of this method and its variants? That brings us back to what made this domain so attractive: the constraints and a priori knowledge. If we now want to loosen those constraints and deal with the outliers that defy our a priori assumptions, we have to become more flexible. And that is where machine learning comes in. Thus the slow drift in the iris recognition community toward machine learning as customers express the desire to recognize irises at a distance in less than ideal lighting conditions with the target moving. It is extremely hard to devise an algorithm to handle all the possible variations. This is in part due to the fact that it is extremely hard to imagine all the possible variations. So being of limited imagination, we simply go out and collect many examples of these new conditions. We could then comb through the data and try to figure out the correct rule for handling each particular case and hope that we had covered them all. Or… we could let a machine learning algorithm discover the best way to weight each feature and save ourselves this tedium and likely failure.
One reason that many don’t take the plunge is that there is an upfront cost to machine learning: the accumulation of a representative set of labeled examples. There is also a looming unknown: how many and how widely must we sample to get a representative set? The rule-based approach also has these costs attached to it, but it is on the back-end. Using a handful of examples, you devise your rule. Then you try it on some more, modify it accordingly, and continue trying it on further examples. Rinse and repeat. You still have to make the decision as to when you have evaluated it on enough examples, and to do that, you still have to label data. You just don’t have to do that before you can apply your algorithm, so the cost goes unnoticed initially. We have been through this cycle in the field in the early days of robotics and it has convinced us that it isn’t worth it. And the results are poorer.
Another reason to take the plunge is that, well, not everyone is as smart as Daugman. His algorithm took insight and ingenuity. Its simplicity convinces us that we all could have seen it. Perhaps given enough time and thought this is true. Perhaps. But we don’t have unlimited time to devise the perfect algorithm. So it makes sense to sink the time into the tedious collecting and labeling of examples. There still is room for creativity: machine learning isn’t a panacea for bad features. You still need to figure out how best to represent and reduce your data. The domain expert is not banished from the scene. But there is no reason to believe that he can know the best weighting of the features he has extracted. The combinatorics of the domain prohibit this.
The tedium of labeling is slowly being alleviated by services such as Mechanical Turk. Or rather, shifted to others who are willing to do it for little money. What is needed are better tools for doing the labeling, tools that make the labelers more productive. We have some ideas along this line that we are developing and will hopefully become a product. But until then, it still makes sense to invest the time into collecting and labeling, because you will do it in the end regardless of the approach you take.
So when is machine learning appropriate for robot vision, our main area of interest? I would argue that unless your application is more constrained than iris recognition, it always makes sense.