Skip to content

Neural Networks for hand gesture recognition

Nothing says “party like it’s 2016” like a neural network. Funny enough, in 2005 nothing said “party like it’s 1985” like a neural network either, but that’s a story for a different day. In the story we’re telling today we’re going to look at training and then using a neural network in OpenCV to detect finger positions. We’re using an Intel RealSense camera depth stream because we’re really only interested in the position and location of the hands not any of the other color information. As a word of warning to the reader, this blog post is going to be full of references some pretty specific computer vision techniques that will just be linked to better descriptions of those techniques and the underlying algorithms because otherwise, this post would be novella-length. To turn an image, even a simple grayscale image, into something that can be processed by a neural net requires a few different steps. All the code for our whole repository is located here but there’s some better explanation of what’s going on and why here that hopefully you’ll find helpful and parts of those explanations are linked to relevant parts of the code in the repository.

First, let’s explain exactly what we’re doing here: you may have read about JoyRide, our automotive prototyping platform and this is one of the many projects that we have built on that. My thought was that with an over-wheel camera will allow finger-based control of a phone-to-car interface or systems in the car like radio, GPS, and so on, without needing to reach for a physical control or rewire controls mounted on the steering wheel. So we rigged up a camera, took a lot of images, and got started training… We’re using an Intel RealSense camera to get a depth image. Our camera is positioned over the steering wheel so the car has a view of the drivers’ hands as they’re holding the wheel:


We’re using OpenFrameworks to provide camera access and utility methods and OpenCV to create and train our Neural Network. The particular neural net that we’re using is actually a fairly simple one, an MLP or multi-level perceptron, much less sophisticated or complex than the more recently famous Recurrent Neural Network (RNN) or Deep Neural Network (DNN). Nonetheless, though it’s not as fancy, it works well and is reasonably simple to implement. So let’s get the basics of how we do this down first.

How’s a neural network work? Well, at the simplest, you gather a lot of training data and you label each data sample and then you feed each sample into the neural network to train it. This data gathering and training is a very time-consuming process and it gets more complicated the more data types and data samples you have to work with. For our application, we only need to recognize 5 states: one finger on the left hand, two fingers on the left hand, one finger on the right hand, two fingers on the right hand, or no fingers raised. Our images are also grayscale, which means we’ve got fewer data and less variance to work with. Now what we need is some data to analyze. The larger the data, the more difficult it is to identify the most important characteristics of each category, so what we want to do is get our image down to something easier to work with. Enter Feature Detection.

Feature detection is really just a way to help you figure out what the most distinctive and “interesting” parts of an image are. There are a lot of well-known feature detection algorithms, SURF, SIFT, FAST, each of which has certain advantages and disadvantages. We’re using one called KAZE, named in honor of Taizo Iijima, one the of the most prominent researchers in a branch of math that would, later on, help people with feature detection. So here’s what a typical input image looks like:


Those turquoise circles show the key points that the KAZE algorithm has detected in the image. Every image is going to have a slightly different set of keypoints but we can generalize their relationships to one another. There’s a technique called a descriptor, a way of describing in a vector space how all the key points in an image relate to one another, and that’s once we have all those descriptors we can get closer to training our neural network. There’s just one hitch: we need a way to generalize all the descriptors because by themselves the descriptors are general enough to create a category. To make a proper descriptor that’s so general that we can use it to describe a category e.g. is this “group of features” descriptive enough to be indicative of a certain category? We’re interested in whether an image has two hands with the index finger extended on the right hand, and for that, we need to generalize all the things that are seen in images of an index finger extended on the right hand. So we make something called a Bag of Words.

Bag of Words is used in a lot of different kinds of machine learning; in computer vision, the bag-of-words model (BoW model) treats image features as “words”. In more technical terms, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features. In less technical terms, this image might help.

The normalization part is important because all the data points submitted to the Neural Network need to be the same size, so even though different images all have different numbers of image keypoints our generalized description of them (keypoints -> Bag Of Words -> incidence of each “word” in an image) is the same. So with that out of the way, we can *finally* train our Neural Network. Depending on the number of images that you have this can take from 15 to 1500 minutes. With about 1000 images (which is a tiny dataset) it takes about 30 minutes.

Once we have the Neural Network trained we can then try to figure out how accurate it is by using something called a Confusion Matrix to get a representation of how well our Neural Network is going to represent the different classes that we’ve trained for. With the data I’ve fed my Neural Network, I’m getting a general accuracy of about 95%, which is pretty good for what we’re working on, which is control of non-critical systems and components.

Once the Neural Network is trained, you can save it to an XML file and then load it into any other application. On a 2015 MacBook the Neural Network classifies the image in about 200 milliseconds. There’s plenty I could do to speed it up but for simplified testing, it seems to be sufficient. That’s what I’ve got for now but I’m going to look at using a different technique for classification called a Support Vector Machine or SVM and I’ll definitely drop a post up here on that once it’s ready.