miércoles, 16 de marzo de 2016

Using commands with Speech Recognition - Sixth Week

In my sixth week we considered submitting the project to Microsoft Open Source Challenge. I looked at the tools they had and found that the only one that was helpful for the project was the Speech API from Project Oxford. We could use speech recognition for voice commands in the Field Book app. That way the user would be able to take notes on the field easily and fast.

Project Oxford

This project from Microsoft Research is an Open Source set of APIs for vision, speech and language based on Artificial Intelligence. There is a code for speech recognition in Android, including a sample application that uses it. I downloaded it and ran it on my phone.The recognition works by sending an audio to a server, that after processing it responses with a list of 5 strings, each of one is a possible transcription of what the user said. Here is an example of how the code is:


MicrophoneRecognitionClient m_micClient;
SpeechRecognitionMode m_recoMode = SpeechRecognitionMode.ShortPhrase;
m_micClient = SpeechRecognitionServiceFactory.createMicrophoneClient(
        this,
        m_recoMode,
        language,
        this,
        subscriptionKey);
m_micClient.startMicAndRecognition(); 

The Activity has to implement the interface ISpeechRecognitionServerEvents in order to make the code above work.

There is also a tool called LUIS (Language Understanding Intelligent Service) that lets you build language models. This seemed to be a good thing to use because I wanted a specific language for the application (the language of the valid commands like "set height to 11"). But after testing it for a while, I realized that probably it wouldn't work. I was thinking that giving an input audio LUIS would give always a valid command as output, it turns out it doesn't. So I figured out I would have to make my own code that given a string returns a valid command.

Commands Design with Finite State Machine 

I looked at the problem of understanding a recognized speech as giving a list of words to a Finite State Machine (FSM) to process. If the FSM reads all the words that needs then is going to end on an acceptance state (the ones that has two circles around). That kind of state means a valid command in the application. This works better than simply comparing the string given by the speech recognizer because sometimes there might be words added by mistake. When working with Speech APIs you have to remember that the recognition may fail many times so your program must handle errors and try to fix them.
Diagram of a proposed Finite State Machine


lunes, 7 de marzo de 2016

Adding more classes - Fifth Week

In my fifth week I thought it was time to give the classifier something harder, so I added two more classes to the dataset. The classes were tomato and wheat. I thought that one application for the plant recognition could be that in the future robots would help farmers. They would have a semantic knowledge of plants so they would act different depending on that. For example if it’s wheat would give it some kind of fertilizer different that if it’s a tomato plant.

Training image of the wheat class



Then I trained with the new dataset and obtained bad results, around  60% of accuracy. To try to understand what was going on I implemented a confusion matrix, so that I could see specifically which classes were giving the biggest errors. I realized that classes like potato that had many images and descriptors did very good in testing but classes like cassava with fewer images and fewer descriptors. So probably the problem was that during K-Means the centers generated favor the classes that had more descriptors.

To fix this problem I decided to use the same amount of training and testing images for each class. So I selected 55 images for training and 30 for testing for each class in the dataset. I also changed the structure that the dataset had to have; now the training was separated from the testing image files in different folders inside the class, and now it wasn't necessary to store a dataset object, since the selection of the sets was fixed by where the image was.

Results with 5 classes

ORB Features

k = 32

Time for getting all the local descriptors of the training images was 00:00:02.476804.
Time for generating the codebook with k-means was 00:00:04.964310.
Time for getting VLAD global descriptors of the training images was 00:00:12.084776.
Time for calculating the SVM was 00:00:23.850120.
Time for getting VLAD global descriptors of the testing images was 00:00:06.300698.
Elapsed time predicting the testing set is 00:00:00.034392
Accuracy = 64.0.
Classes = ['cassava', 'pinto bean', 'tomato', 'wheat', 'potato']
Classes Local Descriptors Counts = [24507, 25720, 24192, 25232, 22500]
Confusion Matrix =
 [20  2  2  4  2]
 [ 3 17  1  7  2]
 [ 5  3 14  4  4]
 [ 2  0  1 26  1]
 [ 5  2  2  2 19]

k = 64


Time for generating the codebook with k-means was 00:00:09.258671.
Time for getting VLAD global descriptors of the training images was 00:00:17.179350.
Time for calculating the SVM was 00:00:46.991844.
Time for getting VLAD global descriptors of the testing images was 00:00:08.864136.
Elapsed time predicting the testing set is 00:00:00.065617
Accuracy = 65.3333333333.
Confusion Matrix =
 [19  3  6  1  1]
 [ 4 17  2  5  2]
 [ 4  3 15  6  2]
 [ 1  1  0 28  0]
 [ 4  3  3  1 19]

k = 128


Time for generating the codebook with k-means was 00:00:17.634565.
Time for getting VLAD global descriptors of the training images was 00:00:27.555537.
Time for calculating the SVM was 00:01:34.641161.
Time for getting VLAD global descriptors of the testing images was 00:00:14.336687.
Elapsed time predicting the testing set is 00:00:00.183201
Accuracy = 64.0.
Confusion Matrix =
 [21  2  3  1  3]
 [ 5 13  3  7  2]
 [ 4  4 14  5  3]
 [ 3  0  0 27  0]
 [ 2  4  2  1 21]

k = 256

Time for generating the codebook with k-means was 00:00:35.860902.
Time for getting VLAD global descriptors of the training images was 00:00:52.757244.
Time for calculating the SVM was 00:03:32.905707.
Time for getting VLAD global descriptors of the testing images was 00:00:26.958474.
Elapsed time predicting the testing set is 00:00:00.383407
Accuracy = 62.6666666667.
Confusion Matrix =
 [19  4  3  2  2]
 [ 5 15  2  6  2]
 [ 2  2 16  6  4]
 [ 2  0  0 28  0]
 [ 6  3  3  2 16]

SIFT Features

k = 32

Time for getting all the local descriptors of the training images was 00:00:31.042207.
Time for generating the codebook with k-means was 00:00:28.069662.
Time for getting VLAD global descriptors of the training images was 00:01:56.548077.
Time for calculating the SVM was 00:01:23.322534.
Time for getting VLAD global descriptors of the testing images was 00:00:57.089969.
Elapsed time predicting the testing set is 00:00:00.142989
Accuracy = 79.3333333333.
Classes = ['cassava', 'pinto bean', 'tomato', 'wheat', 'potato']
Classes Local Descriptors Counts = [73995, 79464, 38025, 86212, 28823]
Confusion Matrix =
 [22  2  3  0  3]
 [ 0 24  2  1  3]
 [ 0  1 23  3  3]
 [ 0  0  0 30  0]
 [ 1  6  1  2 20]

k = 64

Time for generating the codebook with k-means was 00:00:56.495541.
Time for getting VLAD global descriptors of the training images was 00:02:20.884382.
Time for calculating the SVM was 00:03:10.961254.
Time for getting VLAD global descriptors of the testing images was 00:01:07.280931.
Elapsed time predicting the testing set is 00:00:00.338773
Accuracy = 78.6666666667.
Confusion Matrix =
 [23  1  3  0  3]
 [ 1 24  2  1  2]
 [ 0  0 25  2  3]
 [ 0  0  1 29  0]
 [ 3  7  1  2 17]


k = 128

Time for generating the codebook with k-means was 00:48:26.116957.
Time for getting VLAD global descriptors of the training images was 00:03:2.914655.
Time for calculating the SVM was 00:06:52.834055.
Time for getting VLAD global descriptors of the testing images was 01:32:33.834146.
Elapsed time predicting the testing set is 00:00:0.620425
Accuracy = 77.3333333333.
Confusion Matrix =

 [23  1  3  0  3]
 [ 1 23  3  1  2]
 [ 0  1 25  1  3]
 [ 0  0  1 29  0]
 [ 1  9  3  1 16]


k = 256

Time for generating the codebook with k-means was 00:08:52.935980.
Time for getting VLAD global descriptors of the training images was 00:10:37.489346.
Time for calculating the SVM was 03:08:49.511758.
Time for getting VLAD global descriptors of the testing images was 00:05:28.551197.
Elapsed time predicting the testing set is 00:00:02.722835
Accuracy = 79.3333333333.
Confusion Matrix =
 [25  2  1  0  2]
 [ 0 25  2  1  2]
 [ 1  0 22  3  4]
 [ 0  0  0 30  0]
 [ 2  7  2  2 17]

Conclusion

I added more classes to the dataset to see how the classification performance would change. One of the important things I realized was that it's necessary to have similar amount of descriptors for each class if you are going to use K-Means (in our case we use it for a codebook from the Bag of Words technique). The best result went from 89.77% of accuracy to 79.33% but with different number of testing images for each case. It is also necessary to use always the same amount of testing images so that the comparison of the performances are more fair.

Confusion matrix for the best result obtained (SIFT k=32)


Looking in the Android Play Store I found applications that claimed to be able to recognize between 20.000 classes of plants. Using my method it would be very difficult to have that amount of classes.

First, it would need a lot of memory to store all the descriptors for each class. Using around 50 training images per class there would be about 20.000 descriptors per class and that's 400 M (million) vectors. Each vector uses at least 32 floating points number (of 32 bit = 4 bytes). So the data used for all the descriptors would be 32 x 4 bytes x 400 M = 51,200 M of bytes =51.2 x 10^9 bytes ~ 47.8 GB of RAM.

Second, it would take a lot of time to use K-Means for 400M vectors and to calculate an SVM for 20.000 classes usings 20.000 * 50 (number of training images) VLAD vectors.

Third, adding only 2 more classes there was a drop of 10% in the accuracy. Having more classes would make the codewords have less discriminating power because there would be many features and may be more similar from each other than with less classes.

When I looked at the reviews of that application many people complained about that the recognition didn't work. Someone said that even one of the most common plants wasn't recognized. Maybe it seems too difficult to have many possible classes, but now companies like Google have achieved very good performance with image recognition, for example in the Google Photos app one can search images using many categories like dog, beach, cascade, food, etc. So having robots that help in farming is possible and that could help feeding the growing population.