Do You Look Like Your Dog?

Steve Ellingson
13 min readFeb 15, 2021

--

Everyone knows that some people are uncannily like their dogs, whether in temperament, demeanor, or physical characteristics. But have you ever wondered whether you look like your dog? Thanks to the fascinating world of machine learning and AI, now you can build a tool to find out.

Project Overview

Background Information: Project Domain & Origin

This project comes from the Udacity Data Scientist Nanodegree program which utilizes convolutional neural networks (CNN) to classify images of dogs according to their breed. When an image file path is provided to the algorithm, the image goes through several steps. First, it is checked for the presence of a human face. If one is detected, the algorithm will predict a dog breed for it. However, if no face is found, the image is then checked for the presence of a dog and, if detected, will predict the breed it most resembles.

The problem lies in how to go about accomplishing these three tasks:

  1. Human face detection
  2. Dog detection
  3. Breed prediction

Metrics & the Data

We will judge how well our CNN performs through accuracy: the proportion of true positives that our classifier produces. We are looking to detect human faces and dogs in an image and then predict a breed for both. This being the case, we are only truly concerned with the ability of our CNN to correctly predict a breed for the dogs in the dog dataset.

Analysis

Data Exploration & Visualization

Before getting into the details of how to implement our methodology, let’s take a quick look at our dataset. Each datapoint is a file path to a BGR image on the Web. We have :

13,233 images of humans

8,351 images of dogs

Our set of dog images will be split into 3 different subsets which we will use for training, validation, and testing of our model.

6,680 training images (80%)

835 validation images (10%)

836 test images (10%)

When we scan our human dataset for faces, we find that 98.7% of images contain detectable human faces, as well as 10.9% of images in the dog dataset.

Images of dogs that were classified as having detectable human faces

The images above are clearly of dogs rather than people but, for whatever reason, the algorithm detected human faces there. The face detector we are using is obviously not perfect. But, while some of this 10.9% are in fact dogs which are incorrectly classified as having detectable human faces, some of the images do contain humans alongside dogs.

Images of humans that exist in the dog dataset

Furthermore, dogs show up in some of the human files as well, albeit less than 1% of the time.

Images in the human dataset that also contain dogs

As previously mentioned, we will be predicting dog breeds for both datasets, so we are not concerned about these anomalies. Our real concern lies in accurately predicting the correct dog breed for images of dogs.

Methodology

First off, we are going to utilize OpenCV’s Haar feature-based cascade classifiers to check for human faces in a set of images. Then, we will implement a pre-trained ResNet-50 model trained on the ImageNet dataset to detect dogs. And finally, we will write our own CNN to predict a dog breed that most resembles either the human or the dog in the image.

While formulating a strategy for each task, we will verify how well we are doing by calculating the accuracy of our methods, because a face detector that fails at finding human faces will do us no better than a dog detector that cannot distinguish a chihuahua from a chainsaw.

Next, we will create a custom CNN and see if we can do better than chance at predicting a correct dog breed. And finally, we will implement transfer learning and bottleneck features using a pretrained CNN to significantly improve the accuracy of predictions.

Human Face Detection

The challenge of detecting human faces in images is one that has been tackled by many intelligent people before. And since they have done a better job than we could ever dream of doing at this point, we will piggyback on their expertise instead of reinventing the wheel for this task. Researchers have produced AdaBoost machine learning-based classifiers to scan images for human features, and they have made it available to us through a Python library for computer vision called OpenCV. We will use this by simply loading the pretrained face detector (which has conveniently been stored as an XML file).

Images of faces detected by an algorithm

Dog Detection

For detecting dogs, we are going to utilize a CNN called ResNet-50. CNNs mimic the way our brains learn to associate images with things over repeated encounters, and ResNet-50 has been trained in a comparable way. We will load it with weights that have been trained on ImageNet, an exceptionally large and extremely popular dataset comprising more than 10 million images, each assigned to 1 of 1,000 distinct categories from goldfish to toilet paper. Within these categories are 113 dog breeds that it can recognize. Our dog detector will use these to determine whether the image we send it can be labeled a dog.

Breed Prediction

Now that we know how to detect human faces and dogs in images, we will need to create a CNN which will predict a breed for both. Making this kind of prediction can be exceptionally difficult since even we humans have a challenging time distinguishing between breeds. Many dog breeds look a lot alike, and some come in several different colors.

Remember that there are 113 dog breeds that our dog detector can spot. If you were to randomly guess a breed for a dog in any one of our images, you would have a 1 in 113 chance of being right. That is less than 1%, an incredibly low bar for our breed predictor to beat. So, at first, we will aim to do better than 1% accuracy with our first attempt at creating a CNN after which we will train a CNN using transfer learning and utilize bottleneck features from a pretrained model to reduce training time without having to sacrifice accuracy.

Implementation

Preprocessing

We begin our implementation by loading our datasets and separating the dog portion into train, valid, and test sets. Additionally, we load a list of 133 dog names which we will eventually use when predicting breeds. Since we are working with relatively clean datasets and are predicting breeds for all images, this is all the preprocessing that needs to be done outside of our detectors and predictor. Additional preprocessing within these functions will be indicated.

Face Detector

We continue by creating a function for human face detection. This function takes a file path for an image as a parameter which we load using OpenCV (or, cv2), convert to grayscale (as is standard procedure), and then pass to OpenCV’s Haar feature-based cascade classifiers, which do the actual face detection for us and return the number of faces detected. If this number is greater than 0, the function returns True.

Dog Detector

Next, we create our dog detector. This function also takes a file path for an image as a parameter. However, the whole point of this function is to utilize another function to predict a label for the image. If that function returns a value between 151 and 268, then the dog detector returns True to indicate that a dog has been detected.

Label Predictor

For the label predictor, we are using the pretrained ResNet-50 model discussed above. This function also takes a file path for an image as a parameter. But because the model requires some preprocessing before it can make a prediction, the image file path is passed to two other functions. One of the functions converts the file path to a 4D tensor, while the other normalizes the image by subtracting the mean pixel from every other pixel in the images. (Additional information about this preprocess_input function can be found here.) The ResNet-50 model is then used to make a prediction that returns a probability vector. Then, by taking the argmax of this vector, we get an integer corresponding to the label predicted by the model. We can subsequently use this integer to identify an ImageNet category (the weights upon which the model has been trained) for our image. But, for now, our label predictor function simply returns the integer so that the dog detector function above can indicate whether this value falls between 151 and 268, the values indicating dog categories.

Convert Image Path to Tensor

Here, we will create a function that converts an image file path to a 4D tensor so that the ResNet-50 CNN model can predict a label for the image. Again, the function takes an image file path as a parameter. It then loads it as a square image that is 224 x 224 pixels. Next, it converts the image to an array and then expands it to a 4D tensor of shape (1, 224, 224, 3). This 4D tensor is then returned so that it can be passed on to the CNN for label prediction.

These functions complete the tasks of human detection and dog detection. Our next task is breed prediction. To do this, we are first going to create our own CNN to see if we can beat a 1% accuracy rate, or, in other words, do better than making a random guess.

Custom CNN

First, we need to do some preprocessing to rescale our images by dividing every pixel in each image by 255.

Next, we will create our CNN using a Sequential model so that we can incrementally stack our own layers. We are starting with a convolution layer of 16 feature maps, each using 3 x 3 pixels filters, with an input shape of (224, 224, 3). This is the size of the input image tensors plus the number of channels — there are 3 channels, one for each red, green, and blue channel of our images. After this, we add a max pooling layer which reduces the dimensions of the input tensors, allows for assumptions to be made about the features, and reduces overfitting. We will repeat this process 4 times, doubling the number of feature maps each time until we reach 128. Finally, we will add a global average pooling layer and a dense layer to finish off the model. The global average pooling layer reduces the total number of parameters in the model, and the dense layer compiles all the data extracted from the previous layers in the model.

Having successfully created a model, we now compile the model, create a checkpointer to save the best weights during fitting, and fit the model with our data using the train and valid tensors and targets that we created when loading our datasets.

Once the lengthy process of fitting the data to our model is complete, we load the best weights that were saved by the checkpointer into our model and run an accuracy test against our test targets.

This accuracy test tells us that our model is accurate over 12% of the time. Not great but beating our hoped-for accuracy of 1% by about 12 times. Clearly, we can do much better, faster. And we will do this by utilizing transfer learning on a pretrained CNN.

Refinement: Transfer Learning & Bottleneck Features

Transfer learning uses the last convolutional output of a CNN as the input for a model. This way, we only need to add global average pooling and dense layers. We are also using bottleneck features in the training of our model. Bottleneck features are the last activation maps before the fully-connected layers in a CNN model. All of this allows us to achieve relatively high accuracy with few training images.

We are using the bottleneck features from a ResNet-50 CNN to produce our better model. Again, we create a instance of a Sequential model then, since we are using transfer learning, only add global average pooling and dense layers to the model. The input shape of the pooling layer is set to the size of the train set of bottleneck features, and our number of predictable dog names is included in the dense layer.

We again compile this model, create a checkpointer with a file path for saving the best model weights, and fit the model using the train and valid bottleneck features and targets.

After fitting, we load the best weights and test the accuracy of our model.

We now have a model that reaches about 80% accuracy — a huge improvement over the 12% or so accuracy of our custom CNN. Utilizing transfer learning and bottleneck features produces a much more accurate model than the one we started with. And with that, we can begin predicting dog breeds for the images in our original datasets.

To predict dog breeds, we write a function that takes in an image file path as a parameter. Next, we extract the bottleneck feature using the pretrained ResNet-50 model to make predictions for the input image file path. Then, we use our model (which we produced above with transfer learning) to predict a vector for the extracted bottleneck feature. We take the argmax of this vector to get the integer which corresponds to a dog breed, and we match that integer to the breed in our previously loaded list of dog names. To finish up, we simply format the predicted dog breed for printing and return it.

And with that, we finally have the functions we need to define a function for making predictions of dog breed for images in our dataset.

Results

Model Evaluation & Validation

Initially, our custom CNN contained several repeating layers, beginning with 16 feature maps and doubling each layer until ending with 128. Each used a 3x3 pixels filter, an input shape of (224, 224, 3) corresponding to the characteristics of processed input tensors, and activation set to use ReLu. A global average pooling layer is then added before ending with a SoftMax-equipped dense layer with one node for each predictable dog name (133). This produced a model with 114,597 total parameters. We compiled this model with loss set to categorical_crossentropy, optimizer set to rmsprop, and metrics set to accuracy. We trained this model over 20 epochs, using a checkpointer to save the best performing weights to a file. We then loaded these weights into our model and tested it, producing an accuracy of over 12%.

The challenge presented by this initial custom model is that it takes a long time to train (due to the size of our dataset) and produces a less-than desirable accuracy rate.

To improve upon this, we shifted to a pretrained ResNet-50 model as a fixed feature extractor. To this we only had to add a global average pooling layer and a dense layer, which we equipped with SoftMax and contained one node for each predictable dog name. The produced a model with more than double the number of total parameters — 272,517.

We compiled this model with loss set to categorical_crossentropy, optimizer set to rmsprop, and metrics set to accuracy. We then trained it on bottleckneck features and targets over 20 epochs and used a checkpointer to save the best performing weights to a file. We then loaded these weights into our refined model and tested it. This produced an accuracy score of over 81%.

Justification

Transfer learning allowed us to take the previous training of a ResNet-50 CNN and apply it to our refined CNN. Then, using a smaller set of bottleneck features as the input for the last layers of our model, we were able to significantly speed up and improve the training and accuracy of our refined model. This method enabled us to predict the correct breeds for dogs over 81% of the time, a huge improvement over our initial model.

Conclusion

Reflection & Improvement

The process of creating a way to detect both humans and dogs and then make a prediction of dog breed is not exactly a short and simple one. We utilized a feature-based cascade classifier from OpenCV to detect human faces. This comes with the drawback that a user must pass in an image that provides a clear view of a human face for the face to be accurately detected. This could be circumvented by instead implementing a multi-task cascaded CNN (MTCNN) which can detect the presence of additional facial features. But for our purposes here, the feature-based classifier is sufficient.

We also created a custom CNN to see if we could predict better than chance, and then improved upon it by utilizing transfer learning and bottleneck features from a pretrained model to reach ~81% accuracy — a huge improvement over the initial custom model’s ~12% accuracy. It is possible to achieve even better results by putting in the time to tune the model parameters even further, to add additional layers, and to perhaps train over a different number of epochs.

In the end, we produced a workable solution for using convolutional neural networks to detect whether an image contains a human or a dog and to predict a dog breed for it with reasonably acceptable results. So, if you are looking to test whether you are one of those people who resembles their dog, now you can set up an algorithm to check.

--

--