PROPOSED METHODOLOGY

section{Introduction}

This chapter outlines the proposed methodology adopted in this research in constructing enhanced instance segmentation and improved calories estimation of a single RGB food image to achieve the research objectives of the present study, including enhancing the method of recognition and segmentation in Section 3.3 and improved the estimation method for detecting the calories of food to obtain a better calories estimation rate of food in Section 3.4.

section {Research methodology outline}

The improved food calorie estimation depends on two factors; they are food segmentation and food calories estimation.

The food segmentation includes four stages, as illustrated in Figure ref{fig1}.

The first stage identifies the food segmentation method. Moreover, this stage compares and reviews different food segmentation methods to provide suitable segmentation algorithm to handle food and determine the research gaps between them. The second stage proves deep learning as an appropriate algorithm to deal with food segmentation. However, it still suffers from several problems, including the inability to deal with foods with diverse shapes, colours, and sizes. Accordingly, there is a need for enhanced segmentation based on deep learning algorithms for multiple types of foods.

The thirds stage proposes an overview of the enhanced component of deep learning. The enhanced deep learning component consists of two main components: develop the backbone and develop the layer function.

The final stage is to evaluate the proposed solution to increase the proposed solution’s performance through steps. The first step prepares the food dataset, labelling, and training; the second step evaluates the proposed backbone and function layer to satisfy the performance requirement in high-performance data against the existing algorithm. The third step evaluates the performance of food instance segmentation against the existing algorithm.

Food calories estimation includes four stages, as illustrated in Figure ref{fig1}.

The first stage identifies the food calories estimation method. Moreover, this stage compares and reviews different food calories estimation methods to provide suitable food calories estimation and determine the research gaps between them. The second stage proves a single image approach as a suitable approach to deal with the food calories estimation method by taking a single RGB image from any suitable camera position, without reference objects, with an irregular shape, and without depth-sensing cameras.

The thirds stage identifies an overview of the improved 3D construction algorithm. The improved 3D construction algorithm’s enhanced component consists of two main components: develop the encoder-decoder algorithm and combine point cloud and finite element.

The final stage evaluates the proposed solution to increase the proposed solution’s performance

through steps. The first step prepares the food calories dataset, labelling, and training. The second step evaluates the proposed encoder-decoder algorithm and estimation results of food calories to satisfy the high-performance rate’s performance requirement against other existing approaches.

%The research methodology in constructing the enhanced instance segmentation and improved 3D construction algorithms based on a single RGB image for better food calories estimation, as illustrated in Figure ref{fig1}, shows the main steps towards enhanced the multiple type of food instance segmentation for classifying and localising the segmentation of each food item of the image and the estimation of calories for each image.

%In the first stage, the architecture of multi-food instance segmentation is proposed through two steps. Firstly, the extraction of low and high levels of features through the proposed backbone. The second step is to develop a new technique called the Max RoI layer, which managed various features map to obtain the best boundaries for the multiple types of food image.

%During the next stage ( improved calorie estimation based on a single RGB image), an improved CNN model is used to transform multiple foods to estimate the depth based on the notion of our encoder-decoder algorithm. To distinguish the level of class and achieve the accurate estimation of calories of food.

% Once determining the input image’s depth, point cloud, and convex hull algorithms to reconstruct the 3D shape.

%After that, the finite element analysis algorithm is conducted to find the irregular and regular volume of multiple food types.

%The literature shows that food calories estimation of multiple types is considered an open research issue in the computer vision field.

%This is due to many challenges, such as different ingredients, sizes, shapes, cooking methods, and duplication.

%The performance and effectiveness of food calories estimation of multiple types depend on two factors: food segmentation and food volume estimation.

%Food segmentation problems can be successfully solved by using deep learning algorithms. This is due to the promising results obtained when used citep{food4,food5,food6,food7}, compared to the algorithms, the existing segmentation algorithms based on traditional algorithms citep{seg3, seg4, seg9, seg8} had a low-performance rate. Despite the use of deep learning in solving the problem of instance segmentation food, but still suffers from several issues, including the inability to deal with foods with a diversity of shapes, colours, and sizes citep{food4,food5,food6,food7}. Accordingly, there is a need for enhanced segmentation based on deep learning algorithms for multiple types of foods.

%the architecture of enhanced multiple types of food instance segmentation is proposed through two steps. Firstly, an enhanced backbone for recognising and identifying food based on CNN algorithms by extracting low and high characteristic levels from the given images, through enhancement on ResNet as the fundamental building block and connected it with the squeeze-and-excitation (SENet) architecture.

%The second step is to enhance the instance segmentation method by adding a new layer, known as the max RoI layer, to achieve the optimal boundaries for multiple food types by annulling the quantization and utilized max pooling to extract sharp and smooth features.

begin{figure}[H]

includegraphics[width=16cm,height=20cm]{chfig1.png}

caption{The research methodology outline.} label{fig1}

end{figure}

%The food calories estimation in the literature has several challenges, such as (1) models typically require multiple degrees of human involvement, forcing participants to move, change, and weigh prefabricated food models to suit the images’ food types. (2) The stereo-based solution allows participants to take several food images from different viewing angles. (3) Reference objects are also needed for precise assessment besides food types. (5) Depth-sensing cameras are also required to capture food images.

%In this research, improved calories estimation of RGB food image by taking a single RGB image from any suitable camera position, without reference objects, with an irregular shape, and without depth-sensing cameras through improved a depth estimation architecture based on a proposed encoder-decoder algorithm to handle the depth estimation from a single RGB image.

%The depth estimation problems from a single RGB image can be successfully solved using deep learning algorithms. This is due to the promising results obtained citep{d8, d21, d22,36}, on the contrary to the traditional algorithms citep{d4,d5,d6,d7} required specific environmental assumptions. Besides, there are issues related to finding the depth of images, such as determining the depth’s heterogeneity. Despite the use of deep learning in solving the problem of the depth estimation problems from a single image, it still suffers from several problems, including inference problem and needing a long computation time citep{d51,d14,d21,d22}. Accordingly, there is a need for improved depth estimation problems from a single image based on deep learning algorithms.

%Besides, enhanced reconstruct the 3D shape and the irregular and regular volume of food by combined point cloud and finite element analysis algorithm for its ability to reconstruct the 3D shape and handle some foods of irregular shape, as shown in Figure ref{fig1}.

section{Enhancing instance segmentation algorithm for multiple types of food}

This section presents notable insights into the multiple types of food instance segmentation through the identification of multiple types of food items tasks, and segmentation of each food item based on the input image task are challenging tasks due to several challenges, including the multiplicity and the aforementioned difference between the colour and sizes of the types of food. Moreover, this study is proposed an algorithm by enhancing many methods, including the backbone for recognising and identifying types of food and segmentation algorithms (refer to Figure ref{figalg}) to obtain the most substantial result accurate presentation of the multiple types of food image.

begin{figure}[H]

centering

includegraphics[width=textwidth,height=18cm] {part1.png}

caption{Enhanced instance segmentation algorithm for multiple types of food.}label{figalg}

end{figure}

The overall loss function of the multiple types of food instance segmentation for each food item is represented by the following formula:

begin{equation}label{eq1}

L= L_{recognition} + L_{localization} + L_{ segmentation}

end{equation}

Where, $L_{recognition} $ and $L_{ localization}$ loss functions are given according to formula (ref{2}).

begin{equation} label{2}

L({p_i},{t_i})= frac {1}{N}sum _{i=1}| {{p}_{i}^ast – p_{i}} |^{2}+lambda frac {1} {N} sum_i p^{ast} L_{loc} (t_i,t^{ast}_i)

end{equation}\

Where $p_i$ is the predicted probability of anchor $i$, $p^{ast}_i$ is the ground truth of anchor $i$, $t_i$ is the coordinates predicted, $t^{ast}_i$ is the coordinates ground truth, $N$ is the normalization term and $lambda$ is the balancing parameter.

The $L_{segmentation}$ loss function is identified using the per-pixel sigmoid and average binary cross-entropy to generate boundaries for each class, as shown in formula (ref{eq2}).

begin{equation}label{eq2}

L_{segmentation}= -frac {1}{s^2} sum_{1leq i,jleq s}left[y_{ij} log y_{ij}^k+(1- y_{ij} log (1-y_{ij}^k)right]

end{equation}

Where $y_{ij}$ is the ground truth of boundaries of size region $(s^2)$, $y^k_{ij}$ is the predicted value of boundaries, and textit {k} is the ground truth class.

subsection{ Enhancing backbone ResNet architecture for better recognition and identification of multiple types of food }

%ResNet cite{res} is improved in the proposed backbone, where each ResNet block is fed forward directly linked to all other layers. ResNet efficiency is obtained across the deep neural network, as shown in Figure ref{fig:ResNet backboned}, which consists of a series of blocks to overcome the vanishing of gradient cite{res}.

The ResNet has attracted significant interest and attention due to its ability to handle complex data and high accuracy compared to another backbone, which consists of a series of blocks to overcome the vanishing of gradient citep{34}. Although there are problems with the ResNet backbone such as 1) determination ResNet block that has failed to receive sufficient training, 2) determination ResNet block that has received more than sufficient training, and 3) Adopt a large filter size in the first convolution layer. In this research, ResNet is improved in the proposed backbone. The fundamental building block depends on new rules of different ResNet parameters acquired based on the new gradient formula. The layer requiring further training is trained. The training frequency is decreased for some layers, which does not need training to depend on the formula demonstrate in this section to solve the vanishing gradient problems.

For incorporating a content-aware mechanism to weight each channel adaptively the production of each proposed ResNet block is transmitted via the SENet network to provide a linear scalar how relevant each proposed ResNet block which can be written as:

begin{equation} F_text{feature map}=sum _{i=1}^{W}sum_{j=1}^{H} { y_q(i,j)} end{equation}

Where the $y_q$ is the element of the feature map with spatial dimension $H times W$, where textit {H} is the high and textit {W} is the width.

Sequentially, the proposed backbone has five samples of the building blocks consisting of proposed blocks and a SENet. Outputs are integrated from each of the SENet to combine all features from various depth levels by summarising the characteristics of feature maps derived from the five copies of the proposed blocks, as shown in Figure ref{fig3}.

begin{figure}[H]

centering

includegraphics[width = 15cm,height=18cm]{newbackbonr.png}

caption{The existing ResNet-101 backbone citep{34} (b). The enhanced ResNet backbone.}

label{fig3}

end{figure}

begin{enumerate}

item Optimization the ResNet block:

%An analysis of ResNet is performed, followed by the proposal of the network. A novel architecture is proposed by determining a better training for each layer to improve the performance of ResNet and identify a more suitable filter size compared to ResNet citep{34} to extract high and low levels of feature from the input image.

The ResNet’s success is accomplished using the deep network, which consists of a series of blocks to solve gradient problems citep{34}. The process of establishing the two-layer block is presented in Figure ref{fig5}.

begin{figure}[H]

centering

includegraphics[scale=.9] {res.png}

caption{ResNet block citep{34}.} label{fig5}

end{figure}

The formula of the established block of two layers can be defined based on the following formula:

begin{equation} H(x)=F(x,{W_{i}})+x end{equation}

Where textit {x} is the building block input, textit {H(x)} is the building block output vectors, and textit {F(x,{textit W$_i$})} is the residual mapping that is learned in the training process.

The ResNet prevents the issues of vanishing gradient through the identity shortcut, which connects different dimensions. However, the explanation on identity shortcuts in citep{34} is not accurate, although one of ResNet’s main issues is the neglect of the activation layer in the backpropagation process. To illustrate, the formula showing the changing process in the parameter of ResNet is absent, leading to the low accuracy in the gradient formula. Also, the formula used in ResNet does not define which layers need further training than another layer in the training phase. Therefore, the ResNet is analysed into a backpropagation network to elaborate on the issues mentioned above. As seen in the following formula, the total loss function in ResNet is the square of the difference between the predicted output and the ground truth:

begin{equation} L=frac {1}{N}sum _{i=1}^{n}| {hat {y}_{i}-y_{i}} |^{2}=frac {1}{N}sum _{i=1}^{n}| e_{i} |^{2} end{equation}

Where textit {n} is the number of classifications, textit {N} is the normalization term, textit {$y_i$} s the ground truth value, and textit {$hat {y}_{i}$} is the predicted value.

The ResNet in the forward propagation refers to a set of the first input layer

[$x_1, x_2,…,x_m$]. Following that, [$s_1^1,s_2^1,..,s_n^1$] is the first hidden layer connected by weight wi1, $w _i ^1$ where s$_i$=

$theta (s_i)$

through activation function $theta(.)$ in the first hidden layer.

The number of the hidden layer (conv) is demonstrated as follows:

begin{equation}

s_{i}^{conv}=sum _{i=1}^{n}theta (s_{i}^{conv-1})cdot omega _{i}^{conv}+theta (s_{i}^{conv-1})

end{equation}

The predicts output of ResNet is obtained from the last layer, and the proposed backpropagation in ResNet is represented through the following formula:

begin{equation}

frac {partial L}{partial omega _{i}^{5}}=frac {partial L}{partial s_{i}^{5}}cdot frac {partial s_{i}^{5}}{partial omega _{i}^{5}}notag \

end{equation}

begin{equation}

=delta _{i}^{5}cdot frac {partial left({sum _{i=1}^{n}theta (s_{i}^{4})cdot omega _{i}^{5}}right)}{partial omega _{i}^{5}}notag

end{equation}

begin{equation}

\=delta _{i}^{5}cdot theta (s_{i}^{4})

end{equation}

$delta _{i}^{5}$ is defined as follows:

begin{equation}

delta _{i}^{5}=frac {partial L}{partial s_{i}^{5}}notag \=frac {partial frac {1}{N}sum _{i=1}^{n}| {hat {y}_{i}-y_{i}} |^{2}}{partial s_{i}^{5}}notag \

end{equation}

begin{equation}

=frac {2}{N}cdot |hat {y}_{i}-y_{i}|cdot frac {partial hat {y}_{i}}{partial s_{i}^{5}}notag \

end{equation}

begin{equation}

=frac {2}{N}cdot e_{i}cdot theta ‘(s_{i}^{5})

end{equation}

Where, $hat y_i = theta (s^5_i)$, e$_i$ refers to the standard deviation between the predicts output and the actual output of the last layer. This is followed by the calculation of the gradient of L for weight w$_i^4$ based on the following formula:

begin{equation}

frac {partial L}{partial omega _{i}^{4}}=frac {partial L}{partial s_{i}^{4}}cdot frac {partial s_{i}^{4}}{partial omega _{i}^{4}}notag \

end{equation}

begin{equation}

=delta _{i}^{4}cdot frac {partial left({sum _{i=1}^{n}theta (s_{i}^{3})cdot omega _{i}^{4}+theta (s_{i}^{3})}right)}{partial omega _{i}^{4}}notag \

end{equation}

begin{equation}

=delta _{i}^{4}cdot theta (s_{i}^{3})

end{equation}

Where $s_i^4$ have two parts, namely $s_{i}^{4}=sum _{i=1}^{n}theta (s_{i}^{3})cdot omega _{i}^{4}$ as the stander part and $theta (s^3_i)$, which is incorporated through the identity of the shortcut connections. Moreover, the $s_i^4$ is represented through the following formula:

The $delta _i^4$ represents as following formula:

begin{equation}

delta _{i}^{4}=frac {partial L}{partial s_{i}^{4}}notag \

end{equation}

begin{equation}

=frac {partial L}{partial s_{i}^{5}}cdot frac {partial s_{i}^{5}}{partial s_{i}^{4}}notag \

end{equation}

begin{equation}

={theta }'(s_{i}^{4})sum _{i=1}^{n} delta _{i}^{5}omega _{ij}^{5}

end{equation}

Then, the gradient of L for the $w_{i}^{3}$ is calculated based on the following formula:

begin{equation} frac {partial L}{partial omega _{i}^{3}}=delta _{i}^{3}cdot theta (s_{i}^{2}) end{equation}

The following formula is used for the representation of $s_{i}^{3}$ :

begin{equation} delta _{i}^{3}={theta }'(s_{i}^{3})sum _{i=1}^{n} delta _{i}^{4}(omega _{i}^{4}+1) end{equation}

Meanwhile, the remaining hidden layer in ResNet incorporates the same gradient formula, which is as follows:

begin{equation} frac {partial L}{partial omega _{i}^{conv}}=delta _{i}^{conv}cdot theta (s_{i}^{conv-1}) end{equation}

and

begin{equation} delta _{i}^{conv}={theta }'(s_{i}^{conv})sum _{i=1}^{n} delta _{i}^{conv+1}(omega _{i}^{conv+1}+1) end{equation}

Where, conv = 2,.4. Finally, the gradient of L for $w_{i}^{1}$ is calculated based on the following formula:

begin{equation} frac {partial L}{partial omega _{i}^{1}}=delta _{i}^{1}cdot x_{i} end{equation}

and

begin{equation} delta _{i}^{1}={theta }'(s_{i}^{1})sum _{i=1}^{n} delta _{i}^{2}(omega _{i}^{2}+1) end{equation}

Therefore, the gradient of ResNet (5 layers) connection weight is represented as the following formula:

begin{equation} frac {partial L}{partial omega _{i}^{conv}}=delta _{i}^{conv}cdot theta (s_{i}^{conv-1}),quad conv={1,ldots,5.} end{equation}

The gradient of the first connects weight is fading with the increase in the number of layers in the network. The use of identity shortcut connection is used in ResNet to solve the issues of vanishing gradient. The $Delta delta _{i}^{conv}$ refers to the gradient increase in the number of layer (conv) in ResNet and solves the aforementioned issue in a deep network. The following formula represents it:

begin{equation} Delta delta _{i}^{conv}={theta }'(s_{i}^{conv})sum _{i=1}^{n} delta _{i}^{conv+1},quad conv=1,ldots,4. end{equation}

The $Delta delta _{i}^{conv}$ solve the vanishing problem in deep network.

end{enumerate}

%subsubsection{ Improvement in the ResNet}

Although the main importance of the shortcut connections is to solve the vanishing gradient issues, several issues persist in ResNet, including the inadequate reinforcement in the layers. However, a large filter size is selected in the first convolution when the layer’s reinforcement is more than its required amount. In this research, specific strategies are proposed to solve insufficient training and provide an optimal filter size, as shown in Figure ref{fig3}.

%begin{figure}[H]

% centering

% subfloat []{{includegraphics[width=7cm]{fig4a.png} }}

% qquad

% subfloat []{{includegraphics[width=7cm]{resnet.png} }}

% caption{ (a) ResNet architecturecitep{34}. (b) Improved ResNet architecture}%

% label{fig6}

%end{figure}

The repeated formula is combined with the forwards ResNet formula, which is as follows:

begin{equation} y = sum _{r=1}^{n}F(x, {W_{i,r}}) + x end{equation}

Where, textit {r} refers to the number of repetitions for each convolution block of ResNet. For the layer with inadequate training, the textit {r} should be increased, while the exceeding number of training should be reduced. The backpropagation of the improvement in ResNet is based on the ref{eq10}, while the size of the select filter is smaller than ResNet until feature extraction and the reduction in feature size and parameters. As a result, computation efficiency is enhanced.

begin{equation}label{eq10}

frac {partial L}{partial omega _{i}^{conv}}= delta _{i}^{conv}cdot theta (s_{i}^{conv-1}) end{equation}

subsubsection{The architecture of SENet}

The feature-generating maps from the enhanced ResNet have been fed to the SENet network citep{squeeze} to obtain further channel information and enhance the sharing of information, as shown in Figure ref{fig3}. It selectively uses global information to illustrate and eliminate less valuable features by using weights on each feature map’s layers. It contains five operations, including a global average pooling, a fully connected layer, a ReLU function, a fully connected layer, and the sigmoid function. The role of the sigmoid activation for channel weights is suited to the input. The SENet architecture is illustrated, as shown in Figure ref{fig4}.

begin{figure}[H]

centering

includegraphics[width = 6cm,height=8cm]{SEB.png}

caption{The SENet architecture citep{squeeze}.}

label{fig4}

end{figure}

As represented in Figure ref{fig4}, the SENet architecture mainly consists of two processes, which are:

begin{itemize}

item The squeezing process: Produce channel-wise statistics ( $Se$ in $R^D$) through global average pooling, which can be written as:

begin{equation} Se=F_{SENet}(y_q)=frac {1}{H times W}sum _{i=1}^{W}sum_{j=1}^{H} { {y}_{q}(i,j)} end{equation}

Where the $F_{SENet}(.)$ is function of squeezing. $y_q$ is element of the feature map with spatial dimension $Htimes W$, the $q^{th}$ element of $Se$ and $q = 1,2,…,D$.

item Excitation process: Provide identify channel-wise dependencies and significantly minimize the number of parameters through fully connected layers, sigmoid, and ReLU functions, as the following formula.

begin{equation}

T = F_{excitation}(Se,W) = sigma (G(Se,W)) = sigma (W_2delta(W_1 Se))

end{equation}

Where $T ={t_1,t_2,…,t_D}$, $F_{excitation}$ function of excitation and $t_q$ in $R^{HXW}$. $delta(x) = MAX(x,0)$ reference to ReLU function, $G (.,.)$ reference to global function and $sigma (x)=frac {1}{1+e^{-x}}$ sigma mode function.

begin{equation}

hat{P} = F_{scale}(Se_q,y_q) = Se_q . y_q

end{equation}

Where $y_q$ in $R^{H times W}$ and $F_{scale}$ reference to channel-wise multiplication between the scalar $Se_q$ and

the feature map $y_q$.

end{itemize}

subsection{An enhanced multiple types of food segmentation method by adding Max RoI layer }

This research’s focus included multiple types of food, RPN is adopted to determine the position of multiple types of food in the input image citep{39}. Furthermore, the RPN accepts any sizes of the feature map, which function as the output. Meanwhile, the proposed backbone function as the input to generate several rectangular object proposals.

The object is present in the rectangular object proposals, while the sliding window is provided in all the feature maps obtains from the last convolution layer of the proposed backbone. Each sliding window consists of nine anchors, which are the central points of the sliding window. Notably, provides that the sliding window is different in terms of Aspect Ratio (AR) and Scale (S), the coordinate for each anchor is calculated based on the input image, as shown in Figure ref{figrpn}.

begin{figure}[h]

centering

includegraphics[scale=.9] {rpn.png}

caption{The sliding window different in aspect ratio and scale. } label{figrpn}

end{figure}

As a result, the value of $past$ for each anchor is calculated based on two factors:

begin{enumerate}

item The anchors with the highest intersection-over-union overlap and a ground truth box.

item The Overlap Intersection-Over-Union (IoU) for each anchor, which is higher than 0.7.

end{enumerate}

The IoU represents through the following formula:

begin{equation}

IoU = frac{Anchor cap text{ground truth box}} {Anchor cuptext{ground truth box}}

end{equation}

Several rectangular object proposals are generated from RPN on feature maps, as shown in Figure ref{fig2}. As a result, different features of map size are produced, leading to an impact on the instance segmentation accuracy.

begin{figure}[H]

centering

includegraphics[scale=.8] {rpnn.png}

caption{Several rectangular object proposals. }label{fig2}

end{figure}

Several propositions of rectangular objects are created on RPN feature maps, as represented in Figure ref{fig2}.

Consequently, various map size features are designed, which affects the instance segmentation accuracy. This research proposed a layer for handling the feature map’s different sizes. The function map has been reduced over the following two steps to a fixed scale, known as Max RoI.

Suppose that the features map’s size is 5×5, where the rectangular object proposals are encoded in red colour as represented in Figure ref{FEATURE}.

begin{figure}[H]

centering

includegraphics[height=8cm,width=.7linewidth] {FEATURE.png}

caption {The rectangular object proposals in red colour on the feature map.}

label{FEATURE}

end{figure}

The Max RoI reducing the feature maps into the fixed size of the feature map through two-stage:

The first stage is to preserve the position of feature maps by stopping implemented quantification to each RoI boundary via the RoI Pool citep{16}, as represented in Figuresref{r1b}, ref{r1a} and ref{r2b},ref{r2a}. Nevertheless, because of the strong quantization levels for every pixel and success in order to achieve optimal performance in the instance segmentation, the low performance is found in the RoI Pool segmentation citep{16}. This algorithm solves the question of misalignment by annulling the quantization. For each bin, the second stage utilizes max pooling to reduce computational complexity and extract low-level features from the neighbourhood, as represented in Figure ref{r3b}.

begin{figure}[H]

centering

subfloat []{{includegraphics[height=8cm,width=.7linewidth]{maxx.png}}label{r1b}}

qquad

subfloat []{{includegraphics[height=8cm,width=.7linewidth]{RoI1.png} }label{r1a}}

caption{(a). The existing layer (RoIPool) after the application of quantisation. (b). The avoidance of quantisation by proposed layer (Max RoI). }%

end{figure}

begin{figure}[H]

centering

subfloat []{includegraphics[height=8cm,width=.7linewidth]{max2.png} label{r2b}}

qquad

subfloat []{includegraphics[height=8cm,width=.7linewidth]{roi2.png}label{r2a} }

caption{(a). The existing layer (RoIPool) after the second application of quantisation. (b). The second avoidance of quantisation by proposed layer (Max RoI).} %

end{figure}

begin{figure}[H]

centering

subfloat[]{includegraphics[height=8cm,width=.7linewidth]{roi3.png} label{r3a}}

qquad

subfloat[]{includegraphics[height=8cm,width=.7linewidth]{max3.png}label{r3b}}

caption{(a). The existing layer (RoIPool) result. (b). The Max RoI result. } %

end{figure}

%subsection{Fully Connected layer (FC)}

%This section presents the classification of the food items and their regression by reshaping the Max ROI results, which are transferred to the FC layer (refer to Figure 3.2). This phase consisted of two branches, namely food classification and the localisation of each food item.

%The enhancement of FCN cite{f34} is performed to distinguish between the levels of class, while the instance segmentation returned the boundaries of each food item. This phase consisted of three steps, which are as follows:

%begin{enumerate}

%item The first stage is the transfer of the output of Max ROI for food items through a series of 3 x 3 convolutional layers, which are re-applied three times after the application of ReLU to generate boundaries for each region obtained from Max ROI.

%item The second stage involved a 1 x 1 convolutional layer for each feature map obtained from the last convolution.

%item The third stage converted the segmentation size according to the input image through bilinear interpolation, in which different thresholds used in the research included (50, 60, 70, 80, 90).

%end {enumerate}

section {Improved 3D construction algorithms based on a single RGB image for better food calories estimation}

In this section, detailed information and methods regarding the estimation of calories of food in the image are discussed:

begin{enumerate}

item The improved depth estimation architecture based on an encoder-decoder algorithm to handle the depth estimation from a single RGB image is explained in Figure ref{part2}.

item Enhanced reconstruct of the 3D shape by combined point cloud and finite element analysis algorithms through show how the initial and inferred depth images are registered, reconstructed into a complete global point cloud, and enhanced the irregular and regular volume of food, as shown in Figure ref{part2}.

end{enumerate}

begin{figure}[H]

includegraphics[width=16 cm,height=16cm]{part2.png}

caption{Improved 3D construction algorithms based on single RGB image for better food calories estimation.} label{part2}

end{figure}

subsection{Improved a depth estimation architecture based on an encoder-decoder algorithm to handle the depth estimation from a single RGB image.}

With the development of Artificial Intelligence (AI) algorithms, several algorithms have been proposed to estimate the depth based on deep learning, as described in Chapter 2. The current research is interested in investigating the depth estimation of a single RGB image using an end-to-end learning architecture that produced a direct mapping of RGB in-depth, as shown in Figure ref{figencoder}.

subsubsection{Improved encoder-decoder algorithm}

Figure ref{figencoder} showed the proposed encoder-decoder algorithm for depth estimation from a single RGB image.

begin{figure}[H]

includegraphics[width=textwidth,height=15cm]{s2.png}

caption{Improved encoder-decoder algorithm for depth estimation from a single RGB image.} label{figencoder}

end{figure}

Many researchers argue that CNN architecture’s performance will increase with the depth of the CNN architecture. Nevertheless, stacking many layers on the CNN architecture can not guarantee the network improves performance and may, alternatively, lead to a significant performance decrease. This issue is because of the gradient vanishing problems during the training phase citep{d11}, which happens when the CNN architecture is stacked with too many layers. Using the DenseNet, the vanishing issues avoid through a connection between the layers, %as shown in Figure ref{denese}

. However, the DenseNet is found to disregard the activation layer during the backpropagation process. There is no formula within the parameters of DenseNet that described the changing process, which led to reduced accuracy in the gradient formula. The formula used in the DenseNet will not ascertain the layers that needed more training process than others.

%begin{figure}[h]

% centering

% includegraphics[width=linewidth] {s10.png}

%caption{DenseNet blocks citep{35}.} label{denese}

%end{figure}

The proposed architecture in this study improves the DenseNet citep{35} by simplifying and analysing the forward and backward propagation, as shown in Figure ref{figgden}. The new rules of the different parameters in the DenseNet citep{35} are obtained based on the new gradient formula in determining the layer that needs more or reduces training. A filter size more suitable than DenseNet is also selected to extract high and low levels of the features from the input image and the reduction parameters requirement, which leads to a reduced computation time based on the formula ref{FORMULEDENESE}.

begin{figure*}

includegraphics[width=textwidth,height=14cm]{den.png}

caption{(a) DenseNet architecture 169 citep{35}. (b) Improved DenseNet architecture.} label{figgden}

end{figure*}

begin{itemize}

item Analysis DenseNet\

The connection between the layers through the gradient formula citep{35} is the key to solve the gradient vanishing problems. However, there are challenges when directly inferring the forward and backward propagation of DenseNet through the gradient formula. Therefore, the forward and backward propagation of the DenseNet within a network that addresses gradient vanishing are analysed.

begin{itemize}

item Analysis the forward propagation

The total loss function in the DenseNet is calculated using the square of the difference between the expects output and the ground truth, as shown in the following formula:

begin{equation} L=frac {1}{2}sum _{i=1}^{c} | {hat {y}_{i}-y_{i}} |^{2}=frac {1}{2}sum _{i=1}^{c}| e_{i} |^{2} end{equation}

Where:\

$y_i$= The ground truth value. \

$hat {y}_{i}$ = Predicted value.\

$e_{i}$= $ {y}_{i}-hat y_{i}$.\

textit {c} = Number of classification.\

The forward propagation, which is the first convolution layer in the DenseNet, is represented in the following formula.

begin{equation}

s_0=sum_{i=0}^n x_i cdot w_i

end{equation}

The dense block (DenseB), contains the $h(.)$ function that has three operation Batch Normalization (BN), ReLU layer, and convolution kernel, which is a set of the first input layer [$s_0, s_1,……, s_{i-1}$], where each layer receives the maps of the feature from all previous layers as input, as demonstrated in the following formula.\

begin{equation}

DenseB = w_1 cdot h_1(s_0) +w_2 cdot h_2(s_0,s_1)+w_3 cdot h_3(s_0,s_1,s_2)+w_4 cdot h_4(s_0,s_1,s_2,s_3)

end{equation}

begin{equation}

DenseB_{i}=sum _ {j=1} ^ {R}sum _ {i=1} ^ {4} w_i h_i( s)

end{equation}

Where textit {R} is represented repeated number of the dense block.\

Then, the transaction layer are connected to the different dimensions through the following formula.

begin{equation}

y_{0}= sum _{i=0}^{n} w_i cdot theta (DenseB_{i})

end{equation}

Where $theta(.) $ is the activation function and $y_{0}$

is the output from the first transaction layer.\

The forward propagation for the encoder uses the following formula.\

begin{equation}

y_{j}=sum_{j=0}^{3} sum _{i=1}^{N} w_i cdot theta (DenseB_{i})

end{equation}

Where textit {j} is the number of dense blocks in my architecture.

item Analysis the backpropagation

The Predicts output of the simplified encoder is obtained from the weight of the last layer that employed the backpropagation. The gradient textit {L}, is represented in the following formulas.

begin{equation} frac {partial L}{partial omega _{B}}

=&frac {partial L}{partial hat y_{B}}cdot frac {partial hat y_{B}}{partial omega _{B}}notag \

end{equation}

begin{equation}=\&delta _{B}cdot frac {partial left({sum _{i=1}^{4}sum _{j=1}^{N} W_B cdot theta (S_{Bi})}right)}{partial omega _{B}}notag \

end{equation}

begin{equation}=&delta_B cdot theta (S_{Bi}) end{equation}

Where define

begin{equation} delta _{B_4}=&frac {partial L}{partial S_{B_4}}notag \

end{equation}

begin{equation}=&frac {partial frac {1}{m}sum _{i=1}^{c}( {hat {y}_{i}-y_{i}} )}{partial S_{B_4}}notag \

end{equation}

begin{equation}=&e_{i}cdot theta ‘(S_{B_4}) end{equation}

The hidden layers of the encoder of this study have the same gradient formula.

begin{equation}

delta_B = theta ‘(S_{B}.sum delta_{B+1}cdot W_{B+1})

end{equation}

The gradient of the first connects weight fad as the number of layers increases in the network.

This study defines the $Delta delta _{B}^{n}$ as the gradient that increases based on the number of layer (n) in the encoder, as represented in the following formula.

begin{equation} label{FORMULEDENESE}

Delta delta _{B}^{n}={theta }'(S_{B}^{n})sum _{B=1}^{c} delta _{B}^{n+1},quad n=1,ldots,4 end{equation}

The $Delta delta _{B}^{n}$ solves the vanishing problem in a deep network.

end{itemize}

The skip connection technique uses to connect the encoder and decoder that transfers the maps’ feature to the decoder during the upsampling process for depth estimation has sped up the learning of context awareness and overcome translation invariance.

The decoder in this study uses bi-linear for up-sampling, as shown in Figure ref{figencoder}, where the up-sampling block utilises ReLU for activation convolution of the layers.

item Loss Function: In this research, the encoder-decoder algorithm adopts various loss functions as represented in the following formulas.

begin{equation} L_text{{mean absolute}} (L_1)=frac {1}{N}sum _{i=1}^{c}| {hat {y}_{i}-y_{i}}| end{equation}.

begin{equation} L_text{{mean square}}(L_2)=frac {1}{N}sum _{i=1}^{c} ({hat {y}_{i}-y_{i}})^{2} end{equation}.

begin{equation}

L_{huber} = left{begin{array}{lr}

L_1(l_i) & L_1(l_i) geq c,\

frac{L2(li) +c^2} {2c} & text else\

end{array}right] end{equation}

begin{equation}

L_{berhub} = left{begin{array}{lr}

L_1(l_i) & L_1(l_i) leq c,\

frac{L2(li) +c^2} {2c} & text else\

end{array}right]

end{equation}

Where: \

$y_i$= The ground truth value.\

$hat {y}_{i}$ = Predicted value.\

textit{s} = Number of classification layer.\

textit{N} = The normalization term.\

textit{c} =$ frac{1}{5}$ max($| {hat {y}_{i}-y_{i}}|$).\

textit{i} = Indexes value of pixel for each depth image in the current batch.

end{itemize}

%subsection {GrabCut}

%After estimating the depth map of the input image, each item of the food image is cut through the adoption of the GrabCut algorithm cite{f33}, as shown in Figure 3.13.

%The algorithm is based on solving the problem of dividing each part of the foreground object’s input image presented in a difficult and complex environment since it exists in the data set used in this research, making it difficult to subtract the background of the input images. Therefore, the algorithm is adapted to deal with this difficult environment with minimal interaction with the user. The GrabCut algorithm gives each image probability for it to be divided into a cluster based on the probability that is given per image as given in the formula below:

% begin{equation}

% S = (s_1 ……….,s_N ) of N image

%end{equation}

%Where:

%Si = (C$_{1i},C_{2i},C_{3i}), i subset [ 1,……….N).$\

%Where C$_j $= cluster of the colour component in the used colour space.

%The GrabCut algorithm is defined as an array given in the formula below, where the input image is assigned a label for each image.

%begin{equation}

% a = ( a_1,…….,a_n), a_i subset {0,1}

%end{equation}

%Where $a_i $= value of the image.

%begin{figure}[hbt!]centering

%includegraphics[width=textwidth,height=4cm]{cut.png}

%caption{Depth map of the input image after GrabCut} label{fig:pythagoras}

%end{figure}

subsection{Enhanced reconstruct the 3D shape and the irregular and regular volume of food by combined point cloud and finite element analysis algorithm}

Based on the depth map, 3D cloud points are generated for an aspect using camera coordinates centred on the RGB-Depth (RGB-D) dataset citep{f35}, for original and back-to-depth images to obtain textit {XYZ} coordinates. Second, get textit {XYZ} coordinates by lowering

the image by adding the formula below:

begin{equation}

begin{bmatrix}

X \ Y \ Z

end{bmatrix}

= ZK^{-1}

begin{bmatrix}

u \ v \ 1\

end{bmatrix}

and K=

begin{bmatrix}

F_x&0&c_x\ 0&f_y&c_y \ 0&0&1

end{bmatrix}

end{equation}

Where textit { u, v} correspond to image coordinates and textit {X, Y}, and textit { Z} correspond to the universe coordinates. textit {Z} is a scalar number corresponding to the depth map textit{ (u, v)}, and

K in $R^{3×3}$ corresponds to the intrinsic camera matrix. Second, the opposite camera orientation can be conveniently determined by performing a 180-degree camera rotation and conversion into a rotation matrix, respectively. Since the original location is shifted to the middle of the original camera, the angle of the rotation matrix along

the textit {y-axis} can be adjusted to 180 degrees, and the matrix can be generalized as seen in the formula below:

begin{equation}

R_y(theta)=

begin{bmatrix}

cos(theta) & 0 & sin(theta)\

0 & 1 & 0\

-sin(theta)&0&cos(theta)

end{bmatrix}

begin{bmatrix}

-1 & 0 & 0\

0 & 1 & 0\

0&0&-1

end{bmatrix}

end{equation}

Where $(theta)$ is the camera rotation angle around thetextit {y-axis}.

The translation matrix applies to the translation of camera locations original and opposite. The proposed encoder-decoder, as stated in the previous section, receives this extrinsic parameter. After receiving the rotation and translation matrices, the virtual point cloud records with the same universe coordinates applying the formula below:

begin{equation}

begin{bmatrix}

X \ Y \ Z

end{bmatrix}

= R^{-1} (ZK^{-1}

begin{bmatrix}

u \ v \ 1\

end{bmatrix}

-T)

end{equation}

Where textit { R}$in R^{ 3×3} $refers to the rotation matrix, and textit {T}$in R^{3×1}$ represents the translation matrix.

%subsection {Meshing }

After acquiring the point cloud, the next move is to estimate the 3D models of food, so in this research implements the algorithm and follows the convex hull technique citep{f37} to perform 3D models to mesh the food where any constraint does not constrain the convex hull on the form of objects. First, a sphere with a fixed radius is described using a convex hull, where a starting point is chosen from object objects’ contours. The sphere is then rotated with its radius around the food from the beginning point before another point reaches the outline. The sphere is then moved to this stage, and the procedure repeats before the loop finishes.

%subsection{Enhanced the irregular and regular volume of food by conducting finite element analysis algorithm}

After 3D models are estimated, calculating food volumes is based on measuring geometric properties and considering previously known food item models, which are key in calculating the food volumes. However, if an irregular food model exists, to find the food’s irregular and regular volume, this research needs to divide the food into several small portions depending on finite element analysis citep{f38}. Here, 3D food items are divided into a finite number of arbitrarily shaped parts. For each food item, this research calculates the coordinates for each point of a specific item to estimate the mass point through the average of all coordinate points.

%begin{figure}[H]

% centering

% subfloat []{includegraphics[height=8cm,width=.49linewidth]{vol1.png} }

% hfill

% subfloat []{{includegraphics[height=8cm,width=.49linewidth]{vol2.png}}}

% caption{(a). Single tetrahedron with coordinate points a, b, c, and d (b). All tetrahedron on food image.}label{v}

%end{figure}

Finally, the mass point connecting for each 3D point is based on performing a tetrahedron, as shown in Figure ref{figv}.

begin{figure}[H]centering

includegraphics[width=textwidth,height=12cm]{volume.png}

caption{Volume measurement using tetrahedron.} label{figv}

end{figure}

The volume of the food item is obtained by calculating the sum of the volume of every single tetrahedron, as shown in the formula below citep{f38}:

begin{equation}

v=frac{(a-b)cdot((b-d)timesc-d))}{6}

end{equation}

Where textit{a, b, c, and d} are the coordinate vectors of the points.

section{Evaluation the proposed algorithms }

subsection{The dataset }

In this section, there are many types of datasets for multiple food instance and calories estimation. The first dataset is proposed for the multiple food instance segmentation datasets. The second dataset is the Common Object in Context (COCO) benchmark dataset citep{e3} applied to prove the proposed algorithm’s effective detection. Thirdly dataset is the NYU Depth v2 benchmark citep{d39}, which is one of the most well-known datasets for RGB single image depth estimation. Finally, the estimation of calories in the food images dataset is proposed to calculate the calories of type of food for each image.

begin{enumerate}

item Multiple food instance segmentation datasets: The instance segmentation tasks based on deep learning required many datasets and computing power. The computing power issue is solved through the use of GPU. However, a few datasets contain well-annotated open source, for instance segmentation, especially multi-food instance segmentation. Therefore, it is created dataset containing multiple types of food to achieve multiple type of food instance segmentation %, as shown in Figure ref{DATASET}

. Besides, COCO dataset citep{e3} containing food is used in the training process.

In creating a database, the reasons that affect the accuracy of the results are taken. Therefore, the focus is on taking the image of multiple types of food with high accuracy through using iPhone 5 camera (8-megapixel iSight camera, panorama, autofocus, and LED flash). The captured image is that the stored have a different resolution so that this research adaptation image processing algorithm can handle a difference in resolution.

Then, the annotated each image in the dataset using Visual Geometry Group (VGG) Image Annotator (VIA) citep{f32} through drawing a polygon when you get all the points of the polygon.

The annotation file of the dataset has food information such as width, length, and other information. Finally, the annotation file is converted into a COCO dataset format through the python program.

Then choose the item of food in the database for this research is not

randomly selected. So images that selected contain many challenges such as select apple, tomato, and carrot with different size, selected the lemon and banana that have the same features in colours and shape, select grape and apple with different colour, the apple, orange, and lemon come with the same shape. The kiwi and potato have the same porosity. Therefore, the detection and instance segmentation of multiple types of food is a harder task.

The multiple types of food dataset consist of 27 classes (Banana, Onion, Grape, Pear, Rice, Lemon, Pringles, Tomato, Potato, Cucumber, Roasted Chicken Breast, Apple, Bread, Carrot, Egg, Orange, Cantaloupe, Peach, Plum, Kiwi, Cake, Hotdog, Pizza, Donut, Fig, broccoli, Sandwich). Therefore, the multiple food dataset consists of 27000 multiple types of food images.

begin{figure}[H]

centering

includegraphics[width=9cm,height=16cm]{imagd.png}

caption{Example images from multiple type of food instance segmentation dataset.}

label{DATASET}

end{figure}

begin{enumerate}

item Multiple food image enhancement through image processing: This part shows that the food image size is changed to 1028 x 1028 pixels to obtain the most information from the food image. Any image sizes are acceptable in this study through two options, namely:begin{enumerate}

item If the image is smaller than the required size, padding techniques are applied.

item If the image is larger than the required size, the Lanczos re-sampling citep{20} is applied for the down-sampling of the input image while preserving the original image feature.

end{enumerate}

Notably, provided that time consumption is an important aspect in this study, JPEG compression is applied to reduce the image size from 4.38 MB and 2.49 MB to (0.196 MB and 0.0385 MB, respectively. In this case, the dataset is reduced from 2.63 GB to 116 MB to decrease the time consumed in JPEG compression.

end{enumerate}

item Multiple object instance segmentation dataset: The researches are conducted on MS-COCO dataset citep{e3}, as shown in Figure ref{coco}. The COCO dataset is one of the most popular open-source object instance segmentation databases used to train deep learning programs.

The COCO dataset includes 1118k images for training, 5k for validation (Val), and 20k for annotated testing (test-dev). The calculation of COCO average accuracy over IoU thresholds (AP) from 0.5 to 0.95, with an interval of 0.05. All models have been trained on the COCO training set and tested on the Val set. For a fair comparison, the final results are compared with the state-of-the-art instance segmentation algorithm on the test-dev package.

begin{figure}[H]

centering

includegraphics[width = 15CM,height=9cm]{coco.png}

caption{Example images from MS-COCO dataset citep{e3}.}

label{coco}

end{figure}

item Encoder-decoder algorithm for depth estimation of a single RGB image dataset: The quality of the depth estimation is evaluated using the NYU Depth v2 benchmark citep{d39}, which is one of the most well-known datasets for RGB single image depth estimation, as shown in Figure ref{Sus}. \This dataset contained 1449 densely labelled pairs of images from indoor scenes with depth, 464 new scenes, and 407,024 new unlabeled images captured using Microsoft Kinect. Based on previous works that had employed the NYU Depth v2 benchmark in examining the depth estimation citep{d8, d21,d22}, the standard training and testing split are used to evaluate 654 image-depth pairs from the set.

begin{figure}[H]

centering

includegraphics[width = 10CM,height=18cm]{sus.png}

caption{Example images from NYU Depth v2 benchmark citep{d39}.}

label{Sus}

end{figure}

begin{enumerate}

item Depth map inpainting: After proposing the algorithm to determine the depth map as explained previously, the algorithm is trained on an RGB-D object citep{f35}, having many food items such as an apple, orange, and banana to obtain the best results in the depth map prediction process. In addition, the RGB-D Object dataset consisted of 300 common household objects organised into 51 categories. This dataset is recorded using a Kinect style 3D camera that records synchronised and aligned 640 x 480 RGB and depth images at 30 Hz. Each object is placed on a turntable where video sequences are captured for one complete rotation. There are three video sequences for each object, each recorded with the camera mounted at a different height so that the object could be viewed from different angles with the horizon. The dataset also provided ground-truth information in posing information for all 300 objects. This helped to improve the finding of the best estimation for calories based on a single RGB image.

However, the dataset is deficient, given that the image did not contain all pixels having known values. This research needed to infer the missing pixel values, called image inpainting, which is a computer vision problem. The current study adopted Navier-Stokes (NS) method citep{f36}, to obtain the result of missing pixel discrimination by determining the image density and isophote lines into the inpainting region reliant on the border conditions. The method is characterised by finding discontinuities in the slope of the isophote lines for all boundaries of the region by measuring the continuity of the image density and its equal directions across the borders of the region, which leads to the creation of a continuous image, as shown in Figure ref{figma}.

begin{figure}[H]centering

includegraphics[width=textwidth,height=16cm]{na.png}

caption{NS method result on RGB-D dataset.} label{figma}

end{figure}

end{enumerate}

item Estimation of calories in food images dataset

%subsubsection{Calorie volume measurement}

After knowing the volume of the food, the calorie density is based on nutritional facts citep{f39} and mass density citep{f39}, as shown in ref{cal}, is calculated. Prior knowledge of the types of food is based on the classification, as explained above. Here, the estimation of calories of the food image is calculated according to the formula below:

begin{equation}

Cal = v times p times c

end{equation}

Where:\

textit {Cal} = Calorie estimation.\

textit {V} = Volume.\

textit {C} = Calorie density.\

textit {P}= Density.

begin{longtable}{|p{4cm}| p{3cm} |p{4cm} |}

caption{Calorie density chart.}label{cal}

centering

hline

textbf{Food Type} & textbf{Total calorie in g} & textbf{Mass density g /cm3}

hline

endfirsthead

multicolumn{4}{c}%

{tablename thetable — textit{Continued from previous page}} \

hline

textbf{Food Type} & textbf{Total calorie in g} & textbf{Mass density g /cm3}

hline

endhead

hline multicolumn{4}{r}{textit{Continued on next page}} \

endfoot

hline

endlastfoot

Apple raw& 0.52& 0.78 \

hline

Tomato raw & 0.18& 0.47 \

hline

Rice &1.11& 0.45\

hline

Egg boiled& 1.43& 1.03 \

hline

Lemon raw & 0.29& 0.96 \

hline

Potato boiled &0.77& 0.63\

hline

Onions raw& 0.4& 0.95\

hline

Carrot raw& 0.28& 0.46 \

hline

Kiwi raw & 0.61& 0.97 \

hline

Fig raw &0.74& 1.09\

hline

Cantaloupe raw& 0.34& 1.08\

hline

Pringles chips& 5.58& 0.12\

hline

Plum raw& 0.46&0.74\

hline

Orange raw& 0.47 & 0.9\

hline

Cucumber raw&0.1& 0.56\

hline

end{longtable}

end{enumerate}

%section{Performance measurement of enhanced instance segmentation algorithm and improved 3D construction algorithm }

subsection {Performance measurement of enhanced instance segmentation algorithm}

The performance of enhanced instance segmentation algorithm is performed using the following two measurements:

begin{enumerate}

item Averaged Precision over union section over IoU thresholds (AP) with different thresholds is used compare between the multiple food instance segmentation algorithms, which included the enhanced CNN, and instance segmentation citep{42,43,44}.

item Average a Graphics Processing Unit (GPU) time for training and testing time citep{42,43,44}.

end{enumerate}

subsection {Performance measurement of improved 3D construction algorithms}

The performance of an improved 3D construction algorithm consists of two separate steps. The performance measurement of encoder-decoder algorithm for depth estimation of a single RGB image and performance measurement of enhanced calories estimation in food images.

subsubsection{ Performance measurement of improved encoder-decoder algorithm for depth estimation of a single RGB image}

The performance of the encoder-decoder algorithm had employed the following four measurements citep{d8, d21,d22,d38}:

begin{enumerate}

item A average relative error (rel) = begin{equation}

frac{1}{n}sum ^{n}_{p} frac{{| {y}_{i}- hat y_{i}|}}{y}

end{equation}

item Root mean squared error (rms) = begin{equation}

sqrt{ frac{1}{n}sum ^{n}_{p} ({{y}_{i}- hat y_{i}})^2}

end{equation}

item Average ($log_{10}$) error = begin{equation}

frac{1}{n}sum ^{n}_{p}{|log_{10} {y}_{i}- log_{10}hat y_{i}|}

end{equation}

item Threshold accuracy = begin{equation}

max(frac{y_i}{hat{y_i}}, frac{hat y_i}{y_i})=delta< threshold=1.25,1.25^2,125^3

end{equation}

Where \

$y_i$= The ground truth value. \

$hat {y}_{i}$ = Predicted value.\

textit { n}= Total value of pixel for each depth image.

end{enumerate}

subsubsection{Performance measurement of enhanced estimation of calories in food images}

The performance of the enhanced estimation of calories in food images had employed the following two measurements:

begin{enumerate}

item The absolute error is the difference between the measured value and the true value.

This measure is calculated using formula ref{aa}

begin{equation}label{aa}

A= |X-Y|

end{equation}

Where:\

textit {A} = The absolute error.\

textit {X }= The measured value.\

textit {Y} = The true value.\

item To solve the problems of significance and units, in this research may compare the absolute error relative to the correct value. Thus, in this research define the relative error to be the ratio between the absolute error and the absolute value of the correct value and denote it by formula ref{a1}:

begin{equation}label{a1}

R=frac{A}{Y}

end{equation}

Where:\

textit {R} = The relative error.\

textit {A }= The absolute error.\

textit {Y }= The true value.\

end{enumerate}

section{Summary of the chapter}

In this chapter, this study’s proposed methodology is presented towards constructing new algorithms for the enhanced instance segmentation for multiple types of food and improved 3D construction algorithms based on a single RGB image for better food calorie estimation, the deep learning technique. A dataset containing food items is used as a benchmark to calculate calories in multiple foods. The research results are discussed in the next chapter, in addition to comparing the proposed methodology in this research with other algorithms to verify the success of the algorithms and their effectiveness.