Handwriting Recognition (MNIST)
Last updated
Last updated
Assoc. Prof. Wiroon Sriborrirux, Founder of Advance Innovation Center (AIC) and Bangsaen Design House (BDH), Electrical Engineering Department, Faculty of Engineering, Burapha University
This section will explain what this document will and will not include, because artificial intelligence, machine learning, supervised learning, neural networks, no matter which one, are very large topics, covering them may become a book, so this document will only include the parts related to loading the MNIST handwriting recognition model on RT-Thread.
Of course, I will also give references at the end of each part. References are a very important part. On the one hand, they can supplement the parts that I have not introduced. On the other hand, they can also provide some support. Because there are too many documents on the Internet now, but not every document is error-free. For example, if you think that some of the formulas and conclusions I listed are a bit abrupt, you can find more detailed derivations and proofs in the references.
This document may still be very long, because machine learning is not pure software development. Simply calling library function APIs requires certain theoretical support. If the theoretical part is not introduced at all, you may not know why the model is designed in this way and how to improve the model if there is a problem. However, if the document is too long, it may be difficult for everyone to have the patience to read it, especially the theoretical part will have many formulas. However, machine learning does have some requirements for theoretical foundations and programming skills . I believe that you will still gain a lot if you keep reading it. I will also try my best to introduce both the theory and application clearly.
The next document is basically pure practical application, without too much theoretical content: training an object detection model using the Darknet machine learning framework .
If you are familiar with the theory of machine learning, you can go directly to the second part Keras training model
If you are familiar with the Keras machine learning framework, you can jump directly to the third part RT-Thread loading onnx model
If you are familiar with RT-Thread and onnx models, then we can discuss how to efficiently implement machine learning algorithms on embedded devices.
This article assumes that everyone can use RT-Thread's env tool to download the software package, generate the project and upload the firmware to stm32. After all, this article focuses on loading the onnx general machine learning model. You can find tutorials about RT-Thread on the official website.
First, let me briefly introduce the scope of each topic mentioned above. Artificial Intelligence is the largest topic. If we use a picture to illustrate:
Then Machine Learning is the topic of this document, but Machine Learning is still a very large topic:
Here is a brief introduction to the three types mentioned above:
Supervised Learning : This is probably the most widely used field. For example, in face recognition, I will give you a large number of pictures in advance, and then tell you which ones contain faces and which do not. You summarize the features of faces from the pictures I give you. This is the training process. Finally, I will provide some pictures that have never been seen before. If the algorithm is well trained, it can distinguish whether a picture contains a face. Therefore, the biggest feature of supervised learning is that there is a training set to tell the model what is right and what is wrong.
Unsupervised Learning : For example, in an online shopping recommendation system, the model will classify my browsing history and automatically recommend related products to me. The biggest feature of unsupervised learning is that there is no standard answer. For example, a water cup can be classified as a daily necessity or a gift.
Reinforcement Learning : Reinforcement learning is probably the most attractive part of machine learning. For example, there are many examples on Gym where computers are trained to play games and get high scores. Reinforcement learning is mainly about finding the method that can maximize your benefits through trial and error (Action), which is why many examples are about computers playing games.
So the rest of the document is about supervised learning , because handwriting recognition requires some training sets to tell me what numbers these images should actually be. However, there are many supervised learning methods, mainly classification and regression:
Classification: For example, handwriting recognition. The characteristic of this type of problem is that the final result is discrete. The final classified numbers can only be 0, 1, 2, 3 but not decimals such as 1.414 and 1.732.
Regression: For example, in the classic case of house price prediction, the results of this type of problem are continuous. For example, house prices will change continuously and there are infinite possibilities, unlike handwriting recognition which only has 10 categories from 0 to 9.
In this way, the handwriting recognition introduced next is a classification problem . However, there are many classification algorithms. This article will introduce the neural network, which has a wide range of applications and is relatively mature .
Artificial Neural Network : This is a relatively general method that can be applied to data fitting in various fields, but images and speech also have their own more suitable algorithms.
Convolutional Neural Network : Mainly used in the image field, which will be introduced in detail later.
Recurrent Neural Network : It is more suitable for sequence inputs such as sound, so it is widely used in the field of language recognition.
To sum up, this document introduces the rapidly developing branch of machine learning under artificial intelligence , and then solves the classification problem under supervised learning of machine learning , using the convolutional neural network (CNN) method in neural networks .
This section mainly introduces the entire operation process of the neural network, how to prepare the training set, what is training, why to train, how to train, and what to get after training.
To do machine learning training and prediction, we first need to know what the model we are training is like. Let’s take the most classic linear regression model as an example. The artificial neural network (ANN) behind it can actually be seen as a combination of multiple linear regressions. So what is a linear regression model?
For example, for the scattered points in the figure below, we hope to find a straight line to fit. The linear regression fitting model is:
In this way, if there is a point x = 3 in the future that is not in the area covered by these points on the graph, we can also predict the corresponding y through the trained linear regression model.
However, the above formula is usually expressed in another way. The final predicted value, y, is usually expressed as hθ (hypothesis), and its subscript θ represents different training parameters, i.e. k and b. The model becomes:
So θ0 corresponds to b, and θ1 corresponds to k. However, this representation model is not general enough. For example, x may not be a one-dimensional vector. For example, in the classic house price prediction, we need to know the house price, which may require many factors such as the size of the house and the number of rooms. Therefore, the above is represented in a more general way:
This is the linear regression model. As long as you know vector multiplication, the above formula is easy to calculate.
By the way, θ needs a transpose θT because we are usually used to using column vectors. The above formula is actually the same as y=kx+b, but it is just expressed in a different way. However, this expression is more general and more concise and beautiful:
In order to make the above model fit these scattered points well, our goal is to change the model parameters θ0 and θ1, that is, the slope and intercept of this line, so that it can reflect the trend of the scattered points well. The following animation intuitively reflects the training process.
It can be seen that it is an almost horizontal straight line at the beginning, but slowly its slope and intercept move to a better position. So the question is, how do we evaluate whether the current position of this line meets our needs?
A very direct idea is to find the absolute value of the difference between the actual value y of all scattered points and the test value hθ of our model. This evaluation index is called the loss function J(θ) (cost function):
The reason why the right side of the function is divided by 2 is to make it easier to find the reciprocal, because if the formula on the right is differentiated, the square above will give a 2, which just cancels out the 2 in the denominator.
Now we have an evaluation indicator. The smaller the value calculated by the loss function, the better. This way we know whether the current model can meet the needs well. The next step is to tell the model how to optimize in a better direction. This is the training process.
In order to make the model parameter θ move in a better direction, it is natural to go downhill. For example, the loss function above is actually a hyperbola. As long as we go downhill, we can always reach the lowest point of the function:
So what is the direction of "downhill"? In fact, it is the direction of the derivative. As can be seen from the animation above, the black dot has been gradually moving along the tangent direction to the lowest point. If we take the derivative of the loss function, that is, the derivative of J(θ):
Now we know which direction θ should move, but how far should it move each time? As shown in the animation above, even if the black dot knows the direction of movement, it still needs to determine how much it moves each time. This amount of movement is called the learning rate α, which allows us to know in which direction and how much the parameter should move each time:
This training method is the famous Gradient Descent method . Of course, there are many improved training methods such as Adam. In fact, the principles are similar, so I will not introduce them in detail here.
The process of machine learning can be summarized as follows: we first design a model, then define an evaluation indicator called a loss function, so that we know how to judge the quality of the model. Next, we use a training method to make the model parameters move in a direction that can reduce the loss function. When the loss function almost stops decreasing, we can consider the training to be over. The final training result is the model parameters, and we can use the trained model to predict other data.
By the way, the linear regression above actually has a standard theoretical solution, that is, there is no need to go through the training process to get the optimal weights in one step. We call it Normal Equation :
So, why do we need to train step by step when there is a theoretical solution that can be solved in one step? Because the above formula contains matrix inversion operations. When the matrix size is relatively small, the amount of matrix inversion operations is not large, but once the matrix size increases, it is almost impossible to invert it with the existing computing power. Therefore, it is necessary to use training methods such as gradient descent to approach the optimal solution step by step.
Let’s go back to the example of handwriting recognition. The linear regression introduced above finally obtains a continuous value, but the final goal of handwriting recognition is to obtain a discrete value, that is, 0-9. So how can this be achieved?
This is the model in the previous part. It is actually very simple. We only need to add a sigmoid function to the final result and limit the final result to 0-1.
As shown in the formula above, the sigmoid function is:
If we apply it to the linear regression model, we get a nonlinear regression model, namely Logistic Regression:
This ensures that the final result is between 0 and 1. Then we can define that if the final result is greater than 0.5, it is 1, and if it is less than 0.5, it is 0. In this way, a continuous output is discretized.
Now we have introduced the continuous linear regression model Linear Regression and the discrete nonlinear regression model Logistic Regression. Both models are very simple and only a few centimeters long when written on paper. So how do such simple models combine into a very useful neural network?
In fact, the above model can be regarded as a neural network with only one layer. We input x and get the output hθ after one calculation:
What if we don't get the result so quickly, but insert another layer in the middle? We get a neural network with one hidden layer.
In the above figure, we use a to represent the output of the activation function , which is the sigmoid function mentioned in the previous part. In order to limit the output to 0-1, if this is not done, it is very likely that after several layers of neural network calculations, the output value will explode to a very large number. Of course, in addition to the sigmoid function , there are many other activation functions, such as Relu , which is very commonly used in convolutional neural networks in the next part .
In addition, we use bracketed superscripts to represent the number of neural network layers. For example, a(1) represents the output of the first layer of the neural network. Of course, the first layer is the input layer and does not require any calculations, so we can see that a(1)=x in the figure, and the output of the activation function of the first layer is directly our input x. However, θ(1) does not represent the parameters of the first layer, but the parameters between the first and second layers. After all, the parameters exist in the calculation process between the two layers of the network.
So, we can summarize the above neural network structure:
If we set the final output layer nodes to 10, then they can just be used to represent the 10 numbers 0-9.
If we add a few more hidden layers, doesn’t it look a bit like interconnected neurons?
If we go a little deeper into Go Deeper (the author mentioned in the paper that his inspiration for deep learning actually came from Inception)
So we get a deep neural network:
If you want to know how many hidden layers you should choose and how many nodes you should choose for each hidden layer, this is the ultimate question of neural networks, just like where you come from and where you are going.
Finally, the training method of the neural network is back propagation. If you are interested, you can find a more detailed introduction here .
Finally, we come to the convolutional neural network that will be used later. From the previous introduction, we can see that the neural network model is actually very simple and does not require much mathematical knowledge. We only need to know matrix multiplication and function derivation. The deep neural network is just repeated matrix multiplication and activation function operations:
Repeating the same operations like this seems a bit monotonous. The convolutional neural network to be introduced below introduces more interesting operations, mainly:
Cov2D
Maxpooling
Relu
Dropout
Flatten
Dense
Softmax
Next, we will introduce these operators one by one.
First of all, the biggest feature of neural networks in the image field is the introduction of convolution operations. Although the name looks a bit mysterious, the convolution operation is actually very simple.
Here we explain why we need to introduce convolution operations. Although the matrix multiplication mentioned above can actually solve many problems, once we enter the image field, multiplication of a 1920*1080 image will result in a matrix of [1, 2,073,600]. The amount of calculation is not small, and the convolution operation can greatly reduce the amount of calculation. On the other hand, if a two-dimensional image is compressed into a one-dimensional vector, the information about the correlation between pixels in the up, down, left and right directions is actually lost. For example, the color of a pixel is usually similar to that of the surrounding pixels, which is often very important image information.
After introducing the advantages of convolution operation, what exactly is convolution operation? In fact, convolution is a simple addition, subtraction, multiplication and division. We need an image and a convolution kernel:
The image above is processed by a 3x3 convolution kernel, which extracts the edges of the image very well. The following animation clearly introduces the matrix operation:
The convolution kernel used in the animation above is a 3x3 matrix:
If we pause the animation:
It can be seen that the convolution operation is actually to scan the convolution kernel on the image in rows and columns, multiply the numbers at the corresponding positions, and then sum them. For example, the convolution result 4 in the upper left corner above is calculated like this (here ∗ represents convolution):
Of course, the calculation process above is not rigorous, but it can conveniently illustrate the calculation process of convolution. It can be seen that the amount of convolution calculation is very small compared to the fully connected neural network, and it retains the correlation of the image in two-dimensional space, so it is widely used in the image field.
Convolution operation is very useful, but the image size becomes smaller after convolution. For example, the 5x5 matrix above is finally obtained as a 3x3 matrix after a 3x3 convolution kernel operation. Therefore, sometimes in order to keep the image size unchanged, it is padded with 0s around the image. This operation is called padding .
However, padding cannot completely ensure that the image size remains unchanged, because the convolution kernel in the animation above only moves one grid in one direction each time. If it moves 2 grids each time, the 5x5 image will become a 2x2 matrix after the 3x3 convolution. The number of steps the convolution kernel moves each time is called stride .
The following is the formula for calculating the image size after a convolution operation on an image:
For example, the image width W = 5, the convolution kernel size F = 3, no padding is used so P = 0, and the number of steps per movement S = 1:
Here I would like to explain that the above calculations are all for one convolution kernel. In fact, a convolution layer may have multiple convolution kernels, and in fact, many CNN models also have more and more convolution kernels as the number of layers increases.
As mentioned above, convolution can keep the image size unchanged through padding, but many times we hope to gradually reduce the image size as the model progresses, because the final output, such as handwriting recognition, actually only has 10 numbers 0-9, but the image input is 1920x1080, so maxpooling is to reduce the image size.
In fact, this calculation is much simpler than convolution:
For example, the 4x4 input on the left, after 2x2 maxpooling, actually takes the maximum value of the 2x2 block in the upper left corner:
So such a 4x4 matrix is reduced in size by half after 2x2 maxpooling, which is the purpose of maxpooling.
When introducing the sigmoid function before, it was mentioned that it is a type of activation function, and Relu is another activation function that is more commonly used in the image field. Compared with sigmoid, Relu is very simple:
In fact, when the number is less than 0, it is set to 0, and when it is greater than 0, it remains unchanged. It's that simple.
So far, we have introduced three operators: conv2d, maxpooling, and relu. The operation of each operator is very simple, but Dropout is even simpler, without any calculation, so there is no formula in this part.
The problem of model overfitting has not been mentioned before, because during the training process of the neural network model, it is very likely that the model fits the training set provided to it very well, but once it encounters data that it has never seen before, it cannot predict the correct result at all. This is when overfitting occurs.
So, how to solve the overfitting problem? Dropout is a very simple and crude method. It randomly picks out some parameters from the trained parameters and resets them to 0. That’s why it is called Dropout. It just randomly drops some parameters.
This is an incredibly simple method, but it works surprisingly well. For example, simply randomly discarding 60% of the trained parameters after maxpooling can solve the overfitting problem very well.
It is still the simple style of convolutional neural network, and there will be no formulas here.
Flatten is just like what it means literally, flattening a 2D matrix, such as this matrix:
It's that simple.
Dense has actually been introduced before, which is matrix multiplication and then addition:
So the convolution part does not require knowing too much mathematical operations.
This is the last operator. For example, if we want to do handwriting recognition, the final output will be 0-9, which will be a 1x10 matrix, such as the following prediction result (actually one line, for the convenience of display written in two lines):
From the 1x10 matrix above, we can see that the 7th number 0.753 is much larger than the other numbers (the subscript starts at 0), so we know that the current prediction result is 7. Therefore, softmax will output 10 numbers as the output layer of the model, each number represents the probability that the image is 0-9, and we take the largest probability as the prediction result .
On the other hand, the sum of the above 10 numbers is exactly 1, so each number actually represents a probability. The model believes that the probability that this number is 1 is 0.000498, the probability that it is 2 is 0.000027, and so on. Such an intuitive and convenient result is calculated using softmax.
For example, there are two numbers [1, 2] after softmax operation:
The final two numbers we get are [0.269, 0.731].
At this point, the first part of the convolutional neural network operators has finally been introduced. The second part will introduce how to actually use the Keras (Tensorflow) machine learning framework to train a handwriting recognition model. Finally, the third part will introduce how to use the generated model to import it into stm32 for operation.
Here we will introduce how to train the Convolutional Neural Network (CNN), which is widely used in the field of images.
This section should not involve a lot of theory. In fact, it is very simple to write code using Keras to train the model. If you find it unclear why the code is written in this way, you can look at the corresponding operators in the previous section.
First we need to introduce the training set. After all, before training we need to see what the training set looks like.
This is the official website of the handwriting recognition database , which has a cross-century style:
This graph is a summary of the accuracy rates of handwriting recognition using different methods from around the world. You can see that the part circled in red shows the worst handwriting recognition results using the Logistic Regression (Linear Classifier) introduced earlier, so what we are going to use next is the Convolutional Neural Network (CNN) (I will use the abbreviation CNN from now on).
The binary format definition of the training set is given below the website:
Of course, this means you need to download the original training set from the website and extract pictures from it. We don’t need to parse the dataset ourselves when using tensorflow .
First, let me introduce the development environment for machine learning. Now the mainstream development environment is Python , but we are not a naked Python and just open Notepad to start writing code. In fact, the most commonly used development environment by data scientists is Anaconda , which integrates Python and R development environments.
We download the Anaconda installation package from the official website https://www.anaconda.com/distribution/ and select it according to our operating system. Since the installation process is basically just a simple next step, I will not introduce it here.
After installation, we open the Anaconda Prompt :
Anaconda actually has a graphical interface, called Anaconda Navigator , but the console is mainly used here because the graphical interface is actually more troublesome to use, because the console can solve the problem with one line of command, which is faster and more convenient.
Then we type:
Now that the development environment is set up, let's activate the current development environment:
Activating the development environment here means that we can have multiple development environments under Anaconda. For example, if you want to compare the computing speed difference between CPU and GPU, you can install two development environments at the same time, and then switch to the CPU development environment or GPU development environment as needed, which is very convenient. If you don't use Anaconda but a Python naked run, you can either use VirtualEnv or repeatedly install and uninstall different development environments.
Next we can start where we write the code:
This will automatically open the browser and you will see our development environment. Create a new notebook here:
You can rename it to mnist-keras:
Now you can start training the model.
2.3.1 Importing library functions
We first import the required library functions and write the code in the box after In[1]:
This is what the codes look like after they are written in. I won’t take screenshots one by one later.
If you are interested in the comment about importing libraries , you can move the cursor to an input box, press Esc and then press m, and the input box will change from a code segment to a comment segment . Anaconda can also save code, comments, and output at the same time, so the experience is very good. More shortcut keys can be found in Help --> Keyboard Shortcuts in the menu bar.
Move the cursor to the code block you just entered, press Shift + Enter on the keyboard to execute it automatically, and a line of code input box will be automatically added below. Importing the library may take some time depending on the configuration of your computer, so please wait patiently.
2.3.2 Download the MNIST training set
Enter a line of code in the code block:
This will automatically download the dataset. The download speed may be slow in China. You can download the MNIST dataset from this address and unzip it to the location where Anaconda Prompt starts Jupyter Notebook. You don’t have to wait for it to download slowly. The default is C:/Users/your username/
2.3.3 Take a look at the MNIST data
We divide the downloaded data set into a training set and a test set. The training set is used to train the model, and the test set is used to detect the accuracy of the final model prediction:
If you are curious about what this image looks like, you can take a look at what it looks like, for example, let's look at the first image in the training set.
See also the second picture:
Now we will start to build the training model.
Also import the Keras library first:
Next, we can build the model. You can see that the model here is exactly the same as the CNN operator introduced in the previous part, including the familiar conv2d, maxpooling, dropout, flatten, dense, softmax, and adam. If you forget what they mean, you can always switch to the previous part to recall them.
This completes the model.
The code really only needs one line for each layer, but you must know why your model is built in this way, such as why maxpooling should be placed after con2d, why dropout should be added, what exactly does the final softmax do, and can it be omitted?
Let’s take a look at what the model we built looks like:
It can be seen that it is indeed one-to-one corresponding to the theory in the previous part.
Next we can start training the model:
This model is very small, but I used the CPU to train it for only 50 iterations, which took about 10 minutes, so we will never use the CPU if we can use the GPU .
We can take a look at the training process just now:
Use pictures to show the training process:
It can be seen that the loss calculated by the cost function of the model in the training set and the test set is decreasing. What is amazing is that the performance of the model on the test set is even better than that on the training set. However, the accuracy of the model is not very high, with an accuracy of just over 60%. You can try to optimize it. In the process of trying to improve the model, you will deepen your understanding of the model. If I directly give a model with very good performance here, it may not be that helpful to you.
We can save the model as a native Keras model:
Of course, in order to load it on stm32, we would rather save it in the format of the general machine learning model onnx:
In this way, you can see two files, mnist.h5 and mnist.onnx, in the default directory of Anaconda Prompt C:/Users/your username . These are the trained models.
Now our model has been trained and saved. The next step is how to use the trained model.
(You can try changing the Dropout probability from 0.5 to 0.3, the accuracy of the training set will increase from 60% to 80%, and the test set will be more than 90%. Why?)
This section will introduce how to use the model after it is trained, that is, the inference process of the model.
Let's first load the model with Python to see if we can make good predictions with the model we just trained. The following code imports the mnist.onnx model that we just trained and saved.
In order to run the model, we need to get the output and input layers of the model first. The output layer is mentioned in the previous part, which should be softmax:
The next step is to use the test set to predict the model:
We can look at the numbers for the model's test set and then see what the model calculated:
We can see that the last softmax layer of the model outputs 10 numbers, among which the 7th number 0.99688894 (the subscript starts from 0) is obviously much larger than the other numbers, which means that the probability that the number in this picture is 7 is more than 99%, and this picture is indeed 7.
It seems that the model just trained can still make predictions normally. Of course, it cannot guarantee 100% accuracy. If you are interested, you can also change the sequence number of X_test[0] in the above code to see how the prediction effect of other test sets is.
At this point, we don’t need to write any Python code in Anaconda’s Jupyter Notebook. The complete code can be seen here.
https://github.com/wuhanstudio/onnx-backend/blob/master/examples/model/mnist-keras.ipynb
3.2.1 Introduction to Protobuf
From here on, our goal is to load the trained onnx model on stm32, so why is Google Protobuf suddenly mentioned here? Because the onnx model structure is saved in the Google Protobuf format.
As we mentioned before, the purpose of model training is to get the weights of variables, which are just pure numbers. However, we cannot just write these numbers into files one by one, because in the model file to be saved, we need to save not only the weights, but also tell the people who will use the model in the future what the model structure is like, so we need to reasonably design the format of the saved file. Different machine learning frameworks have their own model saving formats. For example, the model format of Keras is h5, while the saving format of Tensorflow and onnx is protobuf.
So what exactly is protobuf? Why is protobuf so popular?
In fact, protobuf is very simple and convenient to use. You just need to define a data saving format first, and then use protoc to automatically generate parsing codes for various languages. It currently supports C, C++, C#, Java, Javascript, Objective-C, PHP, Python, and Ruby.
For example, we create a file called amessage.proto
Then we define a binary data storage format, which contains 2 numbers, where a = 1, which means that the data with id 1 is of int32 type and its name is a, but it does not mean that the value of the variable a is 1. Similarly, the data with id 2 is of int32 type and its name is b. The id here cannot be repeated .
Therefore, to use protobuf, you need to define a data format first, and then automatically generate encoding and decoding code for use in different languages. Because it can automatically generate code, protobuf is simple and easy to use and very popular. It is recommended that you use protov3 .
There is also a protobuf library in RT-Thread, which can help us parse and save protobuf files in C language. After all, the onnx model we are going to parse later is saved in the protobuf protocol.
protobuf package address: http://packages.rt-thread.org/itemDetail.html?package=protobuf-c
Although it is assumed at the beginning of the article that everyone is familiar with the RT-Thread software package, I would like to remind you to run pkgs --upgrade before menuconfig to see the latest software package.
You can see that there are two routines in this package. One directly creates data in protobuf format and then decodes it directly; the other routine first encodes the data and saves it to a file, then reads the data from the binary file and decodes it.
I won’t go into more details about protobuf here, because the onnx model format has been defined and we just need to use it directly.
Now that we have the protobuf package supported by RT-Thread, the next step is to figure out how the format of the onnx model is defined. The complete definition of the onnx data format can be seen here.
onnx data format definition: https://github.com/onnx/onnx/blob/master/onnx/onnx.proto3
In order to help us see the model structure more intuitively, here is a tool protobuf editor that can easily parse protobuf files.
After downloading the software, you can parse the mnist.onnx file we produced before according to the following process. The onnx.proto3 file mentioned in the figure above can be downloaded here .
Then we can see what data is in the previously trained model in the pop-up interface.
You can see that it contains the model version information, model structure, model weights, model input and model output, which is the information we need.
The weights you see here may not be exactly the same as mine, because each person's trained model is slightly different.
After introducing the basic theory of neural networks, how to train the MNIST handwriting recognition model with Python, and the protobuf file format of the onnx model, we finally reached the last step, loading the model from the stm32 and running it.
At this point you should be ready to:
The trained model mnist.onnx
An STM32 development board with an SD card. After all, we need to save the model to it before loading it.
Project source code:
RT-Thread loading model: https://github.com/wuhanstudio/onnx-backend
Experience it directly on your computer: https://github.com/wuhanstudio/onnx-parser
First, we need to select the package through menuconfig in env:
I can't help but remind you again here, remember to first in env:
You can see that there are three examples here. The following will introduce these three examples separately. Before reading the source code analysis below, you can also download the code directly to the board to experience it. But remember to open the file system and copy the model to the SD card. If you want to get the same output, please use the examples/mnist-sm.onnx model .
3.4.1 Manual model and parameter construction
The first routine is to manually build the model and parameters, which can help us understand the model structure and the location of the parameters. It is natural and simple to automatically load the weights and model structure later.
Since the model is built manually, we must first know what the model looks like. Here I recommend another onnx model visualization based on netron . The following figure is generated by netron based on the mnist.onnx model we trained before, which is very beautiful:
You can see that our model is roughly like this process. I didn’t write the repeated layer in the middle twice, but we naturally have to add it when we manually model it.
Here is an explanation of why the Dropout used in training is not seen here, because Dropout is only used to prevent overfitting. During training, the trained parameters are randomly discarded and set to 0. Therefore, once the model is trained, we no longer need the Dropout operation.
Then, we need to manually build the above model.
The model weights can be seen in the header file mnist.h. In fact, the weights here are what I copied from the Protocol Buffer Editor. The weights of your trained model may not be exactly the same as mine.
The next step is to use these weights for calculations, that is, to bring these weights into the various operations introduced in the theoretical part. Each operator can be seen in the source code directory, and one operator corresponds to one c file:
The codes of these operators are easy to understand if they correspond to the formulas in the theoretical part. I will not repeat the meaning of each operator here. You can also see in mnist.c that it is just the input image, after the operation of each operator, plus some memory release operations, finally the softmax output is obtained. If I hide the memory operation :
It can be seen that these operations correspond one-to-one to the model in the previous picture. Therefore, after understanding why the theoretical model is established in this way, it will be a sudden enlightenment when looking at the code. However, compared with Python, C needs to manually save the weights and inputs into arrays and reasonably manage the allocation and release of memory.
If we compile mnist.c and upload it to the board, we can see that the prediction results are successfully output:
Since this model is built completely manually, the memory consumption is very small, about 16KB. The following example needs to load the model from the file system, so the memory consumption will be much larger.
It should be noted here that you may have heard that machine learning models need to be quantized when running on MCU. However, for the sake of convenience, no quantization is done here, so the current calculation is based on floating point numbers, which will be slower than after quantization. However, because this model is relatively small, the results can be seen almost instantly.
You can see more about model quantization here .
3.4.2 Manually build model and automatically load parameters
Previously, we built the model manually and copied the weights from the Protocol Buffer Editor to mnist.h manually, which was very laborious, so this example will automatically load the weights based on the name of the currently calculated model.
For example, we can see in the Protocol Buffer Editor software:
If we want to calculate the model of the layer "dense_5", then we need the weight W1, and then we will find the corresponding weight according to the name "W1":
So this routine only has the function of automatically finding weights. Therefore, the parameters we pass in only need to be the names of the layers of the model. If we remove the code related to memory release, the calculation of each layer is still very clear.
If you are unfamiliar with the names of the above operators, you can recall the theoretical introduction in the first part.
3.4.3 Automatically build the model and load parameters
These three routines become simpler as you go down. You can see that the last routine consists of just these two lines of code: loading the model and then running the model.
You only need to specify the input of the model. After all, the input and output of each layer of the model can be calculated automatically.
This example uses valgrind to test and finds that it requires about 64KB of memory, so everyone should remember to check whether their development board has enough memory.
There is one last point that has not been mentioned before. For images, the data order is very important. For example, NWHC and NCWH are slightly different. N represents the number of input images, W represents the image width, H represents the image height, and C represents the number of channels in the image. For example, a color image has three channels, RGB. So the difference between NWHC and NCWH is whether channel C should be placed in front or in the back?
There is a paper that did some research on this issue. It is usually more efficient to choose NCWH on CPU and GPU. This is why most machine learning frameworks use the NCWH format by default. However, on MCUs such as the Cortex-M series, using NWHC is more efficient.
Paper address:
https://arxiv.org/abs/1801.06601
Unconsciously, this document has become so long. I wonder if you have the patience to read it to the end. I believe that you can still gain a lot if you calm down and read it. Here is a summary of what this document introduces.
Classification of Machine Learning Algorithms
Linear Regression (loss function, gradient descent)
Logistic Regression (sigmoid function)
ANN (Back Propagation)
CNN (conv2d, maxpooling, relu, dropout, flatten, dense, softmax)
Protobuf ( RT-Thread package protobuf-c )
onnx model structure ( RT-Thread package onnx-parser )
RTT loads the onnx model and runs it ( RT-Thread package onnx-backend )
The theoretical part is basically introduced here, which is mainly about the use of the darknet framework, and you don’t even need to write any code. Finally, if you are interested in running machine learning models on MCU, I hope this document can still be helpful.