Intelligent annotation of images is one of the major aspects of a full-fledged humanoid robot capable of passing for a human being. This topic is so vast that a solution cannot be found for it at one go. So, I will attempt to solve narrower aspects of this large problem in projects that I take up. This website will contain information on such projects. I am particularly interested in neuroscience informed approaches to solve the problem, so artificial neural networks will play a major role in these projects. Currently, apart from information on the first and only project so far, "Detecting and Localizing Arabic numerals in Images", there is nothing else on this website.
(Tip: Hover over the subtitle box to access additional sections)

Detecting and Localizing Arabic numerals in Images
Introduction and Preliminary work

Introduction and Preliminary work
A plan update and more preliminary work
The digit-recognizing ANN, Mark 1
The digit-recognizing ANN, Mark 2
Final Report

This project originally started out as an attempt to detect and localize telephone numbers in input images. However, due to the short amount of time I had to complete my undergraduate thesis, I had to make do with just creating a digit detection system.

Main aim of the project

In this project, I aim to construct an artificial neural network (ANN) capable of detecting and localizing telephone numbers in an input image.

Figure 1.1: Inputs to the proposed ANN will be images like the one on the left. The ANN will then process the image, detect the presence of telephone numbers, and if there are occurrences of them, return enough information about the position and size of the telephone numbers that permits drawing a border around them like in the image on the right.

Proposed plan of action

  1. Obtain knowledge required to create ANNs
  2. Construct a Convolutional Neural Network (CNN) capable of recognizing digits from the MNIST database
  3. Create a Space Displacement Neural Network (SDNN) based on the CNN that has a larger input frame, and that can localize MNIST digits in their “natural” background
  4. Give the SDNN the capability to selectively detect and localize sequences of MNIST digits that qualify as telephone numbers
  5. Repeat steps 2-4 again and again. With each repetition, increase the complexity of the dataset until it mimics real-life situations.

Additional objectives

I only recently began working with ANNs. So through this project, I also wish to strengthen my fundamentals in them. Additionally, if something piques my interest along the way, say a possibility for improving generalization ability in CNNs, I will spend some time and focus on that for a while before I return to working towards the aim of this project.

Preliminary Work

To ensure that I spend more time on the "science" of the task and less time coding, I needed a framework that I could use to create the ANNs I want to. I could have used existing frameworks like PyBrain or FANN, but I wanted the ability to change the underlying code at ease. That would allow me to incorporate any custom variations to the learning rule. So, after completing step one of my plan, I created a Python module capable of generating and training ANNs of various architectures. To test the module, I constructed a CNN using it and tried training the CNN to separate 8’s from all other digits. I used numerals from the MNIST database to construct the training and testing set. Due to time constraints, I could only use 36 numerals in both the datasets (18 8’s and 2 each of every other digit). In spite of me having used datasets that small, the CNN took me about 10.5 hours to train.

Figure 1.2: This shows the training curve of the CNN, and the error rates in both the testing and training set. The data from the above plot can be accessed in the form of a two-dimensional Python array by following this link (The link leads to a ‘.npy’ file).

The python module is available here for download. However, more work has to be done on the module before I use it to progress to step two and beyond. As can be seen from Figure 1.2, the generalization ability of the CNN turned out to be very poor. This might be due to the small size of the training set or due to the shortcomings of the learning rule that was employed. The size of the dataset could not be increased as the training was already taking too long. I am currently in the process of parallelizing the module to make it run faster. Once the module starts training ANNs fast enough, I intend to work on modifying the learning rule employed to improve generalization. Currently, the changes that should be made to the weights are calculated by the backpropogation algorithm (just normal gradient descent). I will most probably add weight decay and maybe something else to the learning rule.

(Note: The above mentioned Python module has a small bug. The biases of the artificial neurons don’t get updated during learning. This bug has been fixed in the C++ version of the code.)

References

  1. Hecht-Nielsen, R., "Theory of the backpropagation neural network," Neural Networks, 1989. IJCNN., International Joint Conference on , vol., no., pp.593,605 vol.1, 0-0 1989, doi: 10.1109/IJCNN.1989.118638
  2. David J. C. MacKay, (2003). 'The Single Neuron as a Classifier'. Information Theory, Inference and Learning Algorithms. 1st ed. : Cambridge University Press. pp.(471 - 483).
  3. Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 253–256. IEEE, 2010.
  4. www.coursera.org/course/neuralnets

A minor change in my plan of action

I’ve decided not to use the MNIST database as a starting point in my project. The characters in the MNIST database are very intricate and difficult to learn. So if I start with the MNIST database, I might end up wasting a lot of time on training ANNs to learn the intricacies of the characters and lose focus on bigger problems like digit localization. The reason why I wanted to use the MNIST database in the first place was that it being a well known and standard database, I thought testing and comparing the performance of the ANNs I create will be easier with it. However, after thinking about it for a while, as I already mentioned, I decided against using the MNIST database. Instead, I used a dataset of printed characters that I generated myself with Python’s Image Processing Library. Here’s the version of the dataset I used to test the framework I created. I was able to achieve an accuracy of 91.25% on this dataset (The test set comprised of 80 images randomly selected from the 480 images in the dataset).

Change of Programming language

I tried for a really long time to parallelize the Python framework I already created using Python's in-built multiprocessing library, but I couldn't. So with a heavy heart, I abandoned the code I wrote and moved to C++. Parallelizing the C++ version of my code was much easier (I used OpenMP). To make sure the code was successfully parallelized, I created a single layer MLP with a large number of hidden neurons and tried training it with both the parallel and serial versions of the code. On my MacbookPro with 4 cores, the parallelized version turned out to be approximately twice as fast. Click here to download the source code for both the serial and parallel versions.

Completed C++ framework

I’ve completed work on the framework that I intend to use for this project. I might make minor changes to it as time goes by, but for now, I don’t think I’ll be changing it much. The framework now uses RPROP. It converged much faster than simple gradient descent using back-propagation. In one particular case, it converged 20 times faster than backpropagation while in another case in which backpropagation didn't even converge, it converged within an hour. The framework is available for download here. The linked directory also contains the dataset mentioned in the first paragraph, training statistics of the ANN with 91.25% accuracy, and general information about the same ANN. Look below to check out the training curves of the aforementioned ANN.

Figure 2.1:Training Curve of the ANN that could generalize to 91.25% of the test set.

Figure 2.2:Testing Curve of the ANN that could generalize to 91.25% of the test set.

References

  1. Riedmiller, Martin, and Heinrich Braun. "A direct adaptive method for faster backpropagation learning: The RPROP algorithm." Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.
  2. LeCun, Yann. "Generalization and network design strategies." Connections in Perspective. North-Holland, Amsterdam (1989): 143-55.

My novel approach to recognize digits in input images

During my experiments with ANNs, I noticed that the ANNs I trained to function as digit classifiers responded to images not containing a digit with random values. This gave me an idea, which if successfully implemented would have obviated the need to use negative samples. I thought instead of using a single ANN trained on a set of digits and negative samples, I could use multiple ANNs trained as digit classifiers and multiply their output probabilities. I thought if I multiplied the output probabilities of those ANNs, I would get an output value close to one whenever the ANNs were shown a digit and an output value closer to zero whenever the ANNs were shown a non-digit. This seemed plausible because non-digits appeared to elicit random probabilities ranging from zero to one while digits appeared to consistently elicit probabilities closer to one. The results I obtained while testing this hypothesis were initially promising. However, after spending a lot of time trying to obtain satisfactory levels of accuracy with this technique, I decided to give up on this approach.

Figure 3.1: An example of perfect digit recognition using the technique I proposed. Even in this example, it can be seen that the output of the ANNs when shown a blank image is actually not so unpredictable.

Figure 3.2: Examples of mistakes made while using this approach. Please refer to the legend from Figure 3.1. Even though I got a 62.22% number guessing accuracy, apart from a few data points, the results I obtained were plagued with false positives and false negatives like the ones shown here. I used a threshold of 0.6 while guessing the number.

I tried a number of ways to improve the system I constructed using this novel approach. It is a well known fact that ANNs with a larger number of free parameters don’t generalize well. So I thought larger ANNs would result in more random outputs when shown images of non-digits. However, this wasn’t the case. The false positives and negatives persisted despite me using larger ANNs. I also had a feeling that if the ANNs had a good enough conception of each digit, I would get good results with the system due to a reduction in the number of false positives. I tried using more complex datasets and different architectures to accomplish this, but the error rates failed to get any better.

A general update

I have begun using the Virgo Super Cluster at IIT Madras for this project. It feels great because access to this supercomputer is usually only given to post-graduate scholars for work related to their thesis. I tried benchmarking the performance of the supercomputer, my MacBook Pro, and the server in the Computational Neuroscience (CNS) lab by training a batch of six ANNs on all three machines. The results were quite surprising. The Virgo Super Cluster expectedly took less than a tenth of the time taken by my MacBook Pro, but the server in the CNS lab took even less time to train the six ANNs. It took 10% lesser time than the supercomputer. This was inspite of the fact that I used double the number of cores on the Virgo Super Cluster. When I thought about it, I realized that it might be due to the heavy usage of the supercomputer that it took slightly longer than the server in the CNS lab. When I trained the batch of six ANNs in the CNS lab's server, I had the entire server to myself. The inability of my code to scale efficiently would have futher exacerbated the problem. To access the output data obtained while training the batch of ANNs on the three machines, click here.

References

  1. Simard, Patrice, David Steinkraus, and John C. Platt. "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis." ICDAR. Vol. 3. 2003.

Source of inspiration

As explained in the previous section, my novel idea to detect numbers in input images didn’t work. So I will be using an approach inspired by Rowley, Kannade and Baluja’s face detection system. It will be quite radically different from the plan of action I laid down in the first section, so I have outlined it below.

Basic outline of approach

  1. Create a digit detection system based on the paper linked to in the previous paragraph.
  2. Recognize the digits detected by the system.

To test this approach, I’ll be using a subset of the extra digits and images provided along with the SVHN dataset. The subset that I’ll be using can be accessed by following this link. This will be complemented by an initial set of negative samples that I have created from the full images in that subset. After training an ANN on the linked data, I will be replacing the initial set of negative samples with new negative samples from the full images. These new negative samples will be sub-images from the full images that are wrongly classified by the ANN. Then, I will re-train the same ANN on the newly constructed dataset. I intend to repeat this process till I get a sufficiently low error rate on the test set, or till I am not able to pick new negative samples.

An upgrade to the C++ framework

I've added the capability to monitor the error made by the ANN in each class during a classification task. I.e. Now the framework can give error rates for each class separately as opposed to just a single error rate for the entire dataset. The upgraded framework can be accessed here. The framework also now saves the ANN with the best error on the test set and not the ANN with zero error on the training set.

Figure 4.1: Training statistics obtained while training an ANN on the custom dataset I mentioned earlier on. The curve on the left represents the training set, and the one on the right the testing set. The training set had 3331 samples and the testing set had 654 samples. In both the sets, the number of negative samples used was either equal to or nearly equal to the number of positive samples used.

Complete report