1 Introduction
In this paper, we propose a method for imageset classification based on convex cone models, which can exactly represent the geometrical structure of an image set. In particular, we discuss the effectiveness of combining the proposed method and the convolutional neural network (CNN) features extracted from a highlevel hidden layer of a learned CNN.
For the last decade, image setbased classification methods have gained substantial attention in various applications using multiview images or videos, such as 3D object recognition and motion analysis. The essence of image set based classification is on how to effectively and lowcostly measure the similarity between two image sets. To this end, several types of methods using different models have been proposed (Fukui and Yamaguchi, 2005; Sakano and Mukawa, 2000; Fukui and Yamaguchi, 2007; Fukui et al., 2006; Fukui and Maki, 2015; Kim et al., 2007; Wang et al., 2008; Cevikalp and Triggs, 2010; Lu et al., 2017, 2015; Hayat et al., 2015; Feng et al., 2016; Shah et al., 2017; Yamaguchi et al., 1998).
In this paper, among the above various methods, we focus on subspace based methods, considering the compactness of a subspace model, simple geometrical relationship of class subspaces, and practical and efficient computation. In this type of method, a set of images is compactly modeled by a subspace in a highdimensional vector space, where the subspace is generated by applying the Principal Component Analysis (PCA) to the image set without data centering. After converting each image set to a subspace, the similarity between two subspaces to be compared can be calculated by using the canonical angles between the subspaces
(Afriat, 1957; Hotelling, 1936). Typical subspacebased methods are the mutual subspace method (MSM) (Yamaguchi et al., 1998) and its extension, the constrained mutual subspace method (CMSM) (Fukui and Yamaguchi, 2005).Besides the above advantages, the validity of the subspace representation is also supported by the following physical characteristics: images of a convex object with Lambertian reflectance under various illumination conditions can be represented by a lowdimensional subspace, what is called an illumination subspace (Georghiades et al., 2001; Belhumeur and Kriegman, 1998; Lee et al., 2005). In other words, in object recognition, the subspace of an object can be stably generated from even few sample images under different illumination conditions. Our representation by convex cone is an enhanced extension of the subspace representation.
Conventional subspacebased methods take a raw intensity vector or a handcrafted feature as the input. Regarding more discriminant features, many recent studies have revealed that CNN features are effective inputs for various types of classifiers (Sharif Razavian et al., 2014; Chen et al., 2016; Guanbin Li and Yu, 2015; Azizpour et al., 2016). Inspired by the successes in these studies, we expect that CNN features can also work as discriminant inputs for subspace based methods, such as MSM and CMSM. In this paper, we verify the effectiveness of CNN features for subspace based methods as the baseline. To the best of our knowledge, this paper is the first comprehensive report on the validity of the combination of MSM/CMSM and CNN features.
CNN feature vectors have only nonnegative values when the rectified linear unit (ReLU)
(Nair and Hinton, 2010)is used as an activation function. Although there are many types of features with nonnegative constraint, in this paper, we focus on CNN features. This characteristic of CNN features does not allow the combination of them with negative coefficients; accordingly, a set of CNN features forms a convex cone instead of a subspace in a highdimensional vector space.
For example, it is well known that a set of frontfacing images under various illumination conditions forms a convex cone, referred to as an illumination cone (Georghiades et al., 2001; Belhumeur and Kriegman, 1998; Lee et al., 2005). The illumination cone is a more strict representation than the illumination subspace mentioned above. Several previous studies have demonstrated the advantages of convex cone representation compared with subspace representation (Kobayashi and Otsu, 2008; Kobayashi et al., 2010; Wang et al., 2017, 2018). These advantages naturally motivated us to replace a subspace with a convex cone in models for a set of CNN features including the types of features with nonnegative constraint.
In this framework, it is necessary to consider how to calculate the geometric similarity between two convex cones. To this end, we define multiple angles between two convex cones by following the definition of the canonical angles (Hotelling, 1936; Afriat, 1957) between two subspaces. Although the canonical angles between two subspaces can be analytically obtained from the orthonormal basis vectors of the two subspaces, the definition of angles between two convex cones is not trivial, as we need to consider the nonnegative constraint. In this paper, we define multiple angles between convex cones sequentially from the smallest to the largest by repeatedly applying the alternating least squares method (Tenenhaus, 1988). Then, the geometric similarity between two convex cones is defined based on the obtained angles. We call the classification method using this similarity index the mutual convex cone method (MCM), corresponding to the mutual subspace method (MSM).
Moreover, to enhance the performance of the MCM, we introduce a discriminant space , which maximizes the betweenclass variance (gap) among convex cones projected onto the discriminant space and minimizes the withinclass variance of the projected convex cones, similar to the Fisher discriminant analysis (Fisher, 1936). The class separability can be increased by projecting the class of convex cones onto the discriminant space , as shown in Fig.1. As a result, the classification ability of MCM is enhanced, similar to that of the projection of class subspaces onto a generalized difference subspace (GDS) in CMSM (Fukui and Maki, 2015). Finally, we perform the classification using the angles between the projected convex cones . We call this enhanced method the “constrained mutual convex cone method (CMCM),” corresponding to the constrained MSM (CMSM). This idea has been motivated by our previous preliminary work in (Sogi et al., 2018) and this paper shows more deep analysis with extensive and comprehensive experiments.
The main contributions of this paper are summarized as follows.

We verify the validity of the combination of MSM/ CMSM and CNN features, which has not yet been reported in the research fields of computer vision and machine learning.

To enhance the framework of the subspace based methods, we introduce a convex cone representation to accurately and compactly represent a set of features with nonnegative constraint as typified by CNN features.

We introduce two novel mechanisms in our image set based classification: a) multiple angles between two convex cones to measure the similarity between the cones; and b) a discriminant space to increase the class separability among convex cones.

We propose two novel image set based classification methods, called MCM and CMCM, based on convex cone representation and the discriminant space.
The paper is organized as follows. In Section 2, we describe the algorithms of conventional methods, such as MSM and CMSM. In Section 3, we describe the details of the proposed method. In Section 4, we demonstrate the validity of the proposed method by visualization and classification experiments using four public datasets, i.e., CMU PIE (Gross et al., 2010), ETH80 (Leibe and Schiele, 2003), CMU Motion of Body (Gross and Shi, 2001), and Youtube Celebrity (Kim et al., 2008), and a private database of multiview hand shapes. Section 5 concludes the paper.
2 Related work
In this section, we first describe the algorithms for the MSM and CMSM, which are standard methods for image set classification. Then, we provide an overview of the concept of convex cones.
2.1 Mutual subspace method based on canonical angles
MSM is a classifier based on canonical angles between two subspaces, where each subspace represents an image set.
Given dimensional subspace and  dimensional subspace in dimensional vector space, where , the canonical angles between and are recursively defined as follows (Hotelling, 1936; Afriat, 1957):
(1)  
where and are the canonical vectors forming the th smallest canonical angle between and . The th canonical angle is the smallest angle in the direction orthogonal to the canonical angles as shown in Fig.2.
The canonical angles can be calculated from the orthogonal projection matrices onto subspaces and . Let be basis vectors of and be basis vectors of . The projection matrices and are calculated as and , respectively. is the
th largest eigenvalue of
or . Alternatively, the canonical angles can be easily obtained by applying the SVD to the orthonormal basis vectors of the subspaces.The geometric similarity between two subspaces and is defined by using the canonical angles as follows:
(2) 
In MSM, an input subspace is classified by comparison with class subspaces using this similarity as shown in Fig.3.
2.2 Constrained MSM
The essence of the constrained MSM (CMSM) is the application of the MSM to a generalized difference subspace (GDS) (Fukui and Maki, 2015), as shown in Fig.4. GDS is designed to contain only difference components among subspaces . Thus, the projection of class subspaces onto GDS can increase the class separability among the class subspaces, substantially improving the classification ability of MSM (Fukui and Maki, 2015).
2.3 Convex cone model
In this subsection, we explain the definition of a convex cone and the projection of a vector onto a convex cone. A convex cone is defined by finite basis vectors as follows:
(3) 
As indicated by this definition, the difference between the concepts of a subspace and a convex cone is whether there are nonnegative constraints on the combination coefficients or not.
Given a set of feature vectors , the basis vectors of a convex cone representing the distribution of can be obtained by nonnegative matrix factorization (NMF) (Lee and Seung, 1999; Kim and Park, 2008). Let and . NMF generates the basis vectors by solving the following optimization problem:
(4) 
where denotes the Frobenius norm. We use the alternating nonnegativityconstrained least squaresbased method (Kim and Park, 2008) to solve this problem.
Although the basis vectors can be easily obtained by the NMF, the projection of a vector onto the convex cone is slightly complicated by the nonnegative constraint on the coefficients. In Kobayashi and Otsu (2008), a vector is projected onto the convex cone by applying the nonnegative least squares method (Bro and De Jong, 1997) as follows:
(5) 
The projected vector is obtained as .
In the end, the angle between the convex cone and a vector can be calculated as follows:
(6) 
3 Proposed method
In this section, we explain the algorithms in the MCM and CMCM, after establishing the definition of geometric similarity between two convex cones.
3.1 Geometric similarity between two convex cones
We define the geometric similarity between two convex cones. To this end, we consider how to define multiple angles between two convex cones like canonical angles. Two convex cones and are formed by basis vectors and , respectively. Assume that for convenience. The angles between two convex cones cannot be obtained analytically like the canonical angles between two subspaces, as it is necessary to consider nonnegative constraint. Alternatively, we find two vectors, and , which are closest to each other. Then, we define the angle between the two convex cones as the angle formed by the two vectors. In this way, we sequentially define multiple angles from the smallest to the largest, in order.
First, we search for a pair of dimensional vectors and , which have the maximum correlation, using the alternating least squares method (ALS) (Tenenhaus, 1988). The first angle is defined as the angle formed by and . The pair of and can be found by using the following algorithm:
Algorithm to search for the pair and
Let and be the projections of a vector onto and , respectively. For the details of the projection, see Section 2.3.

Randomly initialize .

.

.

.

If is sufficiently small, the procedure is completed. Otherwise, return to 2) setting .

.
For the second angle , we search for a pair of vectors and with the maximum correlation, but with the minimum correlation with and . Such a pair can be found by applying ALS to the projected convex cones and on the orthogonal complement space of the subspace spanned by the vectors and as shown in Fig.5. Then is formed by and . In this way, we can obtain all of the pairs of vectors forming the th angle , .
With the resulting angles , we define the geometrical similarity between two convex cones and as follows:
(7) 
3.2 Mutual convex cone method
The mutual convex cone method (MCM) classifies an input convex cone based on the similarities defined by Eq.(7) between the input and the class convex cones. MCM consists of two phases, a training phase and a recognition phase, as summarized in Fig.6.
Given class sets with images .
Training Phase

Feature vectors are extracted from the images of class .

The basis vectors of class convex cone, , are generated by applying NMF to the set of feature vectors .

are registered as the reference convex cone of class .

The above process is conducted for all classes.
Recognition Phase

A set of images is input.

Feature vectors are extracted from the images .

The basis vectors of the input convex cone, , are generated by applying NMF to the input set of feature vectors.

The input image set is classified based on the similarity (Eq.(7)) between the input convex cone and the th class reference convex cone .
3.3 Generation of discriminant space
To enhance the performance of the mutual convex cone method, we introduce a discriminant space , which maximizes the betweenclass variance and minimizes the withinclass variance for the convex cones projected on , similarly to the Fisher discriminant analysis (FDA). In our method, the withinclass variance is calculated from basis vectors of convex cones, and the betweenclass variance is calculated from gaps among convex cones for effectively utilizing the information formed by convex cones.
We define these gaps as follows. Let be the th class convex cone with basis vectors , be the projection operation of a vector onto defined by Eq.(5), and be the number of the classes. We consider vectors , , such that the sum of the correlation is maximum. Such a set of vectors can be obtained by using the following algorithm. This algorithm is almost the same as the generalized canonical correlation analysis (Vía et al., 2005, 2007), except that the nonnegative least squares (LS) method is used instead of the standard LS method.
Procedure to search for a set of first vectors

Randomly initialize .

Project onto each convex cone, and then normalize the projection as .

.

If is sufficiently small, the procedure is completed. Otherwise, return to 2) setting .
Next, we search for a set of second vectors with the maximum sum of the correlations under the constraint condition that they have the minimum correlation with the previously found . The second vectors can be obtained by applying the above procedure to the convex cones projected onto the orthogonal complement space of the vector . In the following, a set of the th vectors can be sequentially obtained by applying the same procedure to the convex cones projected onto the orthogonal complement space of . In this way, we finally obtain the sets of . With the sets of , we define a difference vector as follows:
(8) 
Considering that each difference vector represents the gap between the two convex cones, we define using these vectors as follows:
(9) 
where can be set from 1 to .
Next, we define the withinclass variance using the basis vectors for all classes of convex cones as follows:
(10) 
where . Finally, the dimensional discriminant space is spanned by eigenvectors corresponding to the largest eigenvalues of the following eigenvalue problem:
(11) 
3.4 Constrained mutual convex cone method
We construct the constrained MCM (CMCM) by incorporating the projection onto the discriminant space into the MCM. CMCM consists of a training phase and a recognition phase, as shown in Fig.7. In the following, we explain each phase for the case in which classes have images each.
Training Phase

Feature vectors are extracted from the images .

The basis vectors of the th class convex cone, , are generated by applying NMF to each class set of feature vectors.

Sets of difference vectors are generated according to the procedure described in section 3.3.

The discriminant space is generated by solving Eq.(11) using and .

The basis vectors are projected onto the discriminant space and then the lengths of the projected basis vectors are normalized to 1. A set of these basis vectors forms the projected convex cone.

are registered as the reference convex cones of class .
Recognition Phase

A set of images is input.

Feature vectors are extracted from the images .

The basis vectors of a convex cone, , are generated by applying NMF to the set of feature vectors.

The basis vectors are projected onto the discriminant space and then the lengths of the projected basis vectors are normalized to 1. The normalized projections are represented by .

The input set is classified based on the similarity (Eq.(7)) between the input convex cone and each class reference convex cone .
4 Evaluation experiments
In this section, we demonstrate the effectiveness of the proposed methods through four experiments. The first experiment uses the ETH80 dataset to verify the effectiveness of using multiple angles between convex cones as the similarity between them. The second experiment analyzes the attribute of difference vectors between two convex cones by visualizing the difference vectors as images. The third experiment evaluates the classification performance of the proposed methods using the three datasets, 1) ETH80 (Leibe and Schiele, 2003), 2) CMU Motion of Body (CMU MoBo) (Gross and Shi, 2001), and 3) YouTube Celebrities (YTC) (Kim et al., 2008), with a large number of training samples. The fourth experiment demonstrates the robustness of the proposed methods against the small sample sizes (SSS) problem, considering the situation in which only few training samples are available for learning. In this experiment, we use the multiview hand shape dataset (Ohkawa and Fukui, 2012)
4.1 Effectiveness of using multiple angles
In this experiment, we verify the effectiveness of using multiple angles for calculating the similarity between convex cones, through a classification experiment using the ETH80 dataset. The ETH80 dataset consists of object images in eight different categories, captured from 41 viewpoints. Each category has ten kinds of object. One object randomly sampled from each category set was used for training, and the remaining nine objects were used for test. As an input image set, we used 41 multiview images for each object. We used images scaled to 32 32 pixels and converted to grayscale. Vectorized features of the grayscale images were used as input, i.e. the dimension of the feature vector is 1024.
We evaluated the classification performance of mutual convex cone method (MCM) and constrained MCM (CMCM), while varying the number of angles used for calculating the similarity. As baselines, the mutual subspace method (MSM) and constrained MSM (CMSM) were also evaluated. Dimensions of reference subspaces and convex cones were set to 20, and dimensions of input subspaces and convex cones were set to 10.
Fig.8 shows the accuracy changes of the different methods against the number of angles. The horizontal axis denotes the number of angles used for calculating the similarity. We can confirm that the accuracy of MCM and CMCM increases, as the number of angles increases. This result shows clearly the importance of comparing the whole structures of convex cones by using multiple angles rather than using only the minimum angle for accurate classification.
In case of using one or two angles, the accuracy of CMCM is less than CMSM. However, with an increase in the numbers of angles, CMCM outperforms the methods MSM and CMSM that are based on subspace representation. This indicates that using multiple angles is required to compare the structures of two convex cones.
4.2 Validity of difference vectors between convex cones
In this experiment, we demonstrate the validity of difference vectors, , between convex cones through the visualization of on two sets of facial expressions, neutral and smile. They were extracted from the CMU PIE dataset (Gross et al., 2010). Each set has 20 front face images taken under various illumination conditions.
After representing the two sets of raw images as convex cones, we generated the difference vectors between the two convex cones according to Eq.(8). For comparison, we also calculated the difference vectors between the canonical vectors of two subspaces of the two sets. We set the number of basis vectors of each convex cone to 5 and the dimension of each subspace to 5.
Fig.9 shows the visualizations of and . We can see that both sets of the difference vectors can emphasize regions around smile lines and eyes. These regions can move largely in comparison with other regions when changing from neutral face expression to smile. However, the resolutions in variation captured by them are a bit different. To take a closer look at this difference, we calculated mean images of the absolute values of the difference vectors, by and , as shown in Fig.10. The difference vectors, , between the subspaces capture roughly difference on the whole face. On the other hand, the difference vectors, , between convex cones capture clearly fine difference on smile lines and around eyes.
Besides, to verify how much a set of difference vectors between two convex cones captures the difference in the structure of them, we conducted a comparison experiment using two synthetic convex cones and , which are shown in Fig.11. The convex cones are spanned by three basis vectors, which were generated by applying NMF to a set of images of two different objects synthesized under 100 illumination conditions. We calculated the difference vectors between and . Let the convex cone spanned by be convex cone . Note that the are not orthogonal to each other, so that they span a convex cone. Besides , we generated a convex cone , which is spanned by three basis vectors obtained by applying NMF to a set of difference image vectors between pairs of object images of classes 1 and 2. According to our definition, we expect that can have a high correlation with . In fact, the first three cosine similarities between and are 0.9104, 0.8478, and 0.5426 , respectively. The high correlations support that a set of the difference vectors, namely the convex cone spanned by them, captures effectively the structural difference between the convex cones.
4.3 Comparison of classification performance with conventional methods
In this subsection, we evaluate the classification performance of the proposed methods compared with various conventional methods using three public datasets. In the following, details of each dataset and experimental protocols are described. After that, experiment results are shown.
4.3.1 ETH80 dataset
The ETH80 dataset consists of eight different categories, captured from 41 viewpoints. Each category has ten kinds of object. Five objects randomly sampled from each category were used for training, and the remaining objects were used for testing. As an input image set, we used 41 multiview images for each object. To conduct a consistent experiment with previous works, we used images scaled to 32 32 pixels (Shah et al., 2017; Hayat et al., 2015). We evaluated the classification performance of each method in terms of the average accuracy of ten trials using randomly divided datasets.
For MSM and CMSM, the dimensions of class subspaces, input subspaces, and GDS were set to 50, 30, and 395, respectively. For MCM and CMCM, the numbers of the basis vectors of class and input convex cones were set to 50 and 30, respectively. The dimension of the discriminant space was set to 450. We determined these dimensionalities by crossvalidation using the training data.
In this experiment, we used CNN features as feature vectors. To obtain CNN features under our experimental setting, we modified the original ResNet50 (He et al., 2016)
trained by the ImageNet database
(Russakovsky et al., 2015) slightly for our experimental conditions. First, we replaced the final 1000way fully connected (FC) layer of the original ResNet50 with a 1024way FC layer and applied the ReLU function. Then, we added a way FC layer with softmax behind the previous 1024way FC layer.Moreover, to extract more effective CNN features from our modified ResNet, we finetuned our ResNet using the learning set. A CNN feature vector was extracted from the 1024way FC layer every time an image was input into our ResNet. As a result, the dimensionality of a CNN feature vector was 1024.
In our finetuned CNN, an input image set was classified based on the average value of the output conviction degrees for each class from the last FC layer with softmax. In this section, we refer to this method as “softmax”.
4.3.2 CMU MoBo dataset
The CMU Mobo dataset (Gross and Shi, 2001) consists of 25 people videos walking on a treadmill. Although the original purpose of this dataset was to research on human gait analysis (Gross and Shi, 2001), in this experiment we conducted image set based face classification following previous works (Shah et al., 2017; Hayat et al., 2015; Cevikalp and Triggs, 2010; Wang et al., 2008).
The face images were detected by the Viola and Jones detection algorithm (Viola and Jones, 2004) from video frames. Detected face images were reshaped to 40 40 pixels and converted to grayscale. Face images extracted from one video was considered as an image set.
The dataset contains four walking patterns (videos) of each person, except for one person. We used videos of 24 people with all walking patterns. One video randomly sampled from each person was used for training, and the remaining three videos were used for testing. We repeated the evaluation ten times with different random selections.
For MSM and CMSM, the dimensions of class subspaces, input subspaces, and GDS were set to 50, 50, and 1000, respectively. For MCM and CMCM, the numbers of the basis vectors of class and input convex cones were set to 50 and 30, respectively. The dimension of the discriminant space was set to 1000. We determined these dimensionalities by crossvalidation using the training data. CNN features were extracted from the finetuned ResNet under this experimental setting, according to the same procedure used in the previous experiments.
ETH80  CMU Mobo  YTC  
DCC(Kim et al., 2007)  91.753.74  88.892.45  51.424.95 
MMD(Wang et al., 2008)  77.505.00  92.502.87  54.043.69 
CHISD(Cevikalp and Triggs, 2010)  79.535.32  96.521.18  60.425.95 
MMDML(Lu et al., 2015)  94.53.5  97.81.0   
ADNT(Hayat et al., 2015)  98.121.69  97.920.73  71.354.83 
PLRC(Feng et al., 2016)  87.725.67  93.744.3  61.286.37 
Reconstruct Model (Shah et al., 2017)  94.754.32  98.331.27  66.455.07 
softmax  96.502.29  98.611.52  64.182.20 
CNN feature + MSM 
99.501.05  99.170.97  64.262.89 
CNN feature + CMSM  99.501.05  99.580.67  66.452.36 
CNN feature + MCM  99.501.05  98.751.22  64.112.68 
CNN feature + CMCM  99.750.79  99.580.67  66.742.12 
Experimental results (recognition rate (%), standard deviation) for the three public datasets.
4.3.3 YTC dataset
The YTC dataset (Kim et al., 2008) contains 1910 videos of 47 people. Similarly to (Shah et al., 2017), as an image set, we used a set of face images extracted from a video by the Incremental Learning Tracker (Ross et al., 2008). All the extracted face images were scaled to 30 30 pixels and converted to grayscale. Three videos per each person were randomly selected as training data, and six videos per each person were randomly selected as test data. We conducted fivefold crossvalidation according to the above procedure.
For MSM and CMSM, the dimensions of class subspaces, input subspaces, and GDS were set to 70, 10, and 824, respectively. For MCM and CMCM, the numbers of the basis vectors of class and input convex cones were set to 50 and 40, respectively. The dimension of the discriminant space was set to 1000. We determined these dimensionalities by crossvalidation using the training data. CNN features were extracted from the finetuned ResNet under this experimental setting, according to the same procedure used in the previous experiments.
4.3.4 Results and discussion
Table 1
shows the classification results of the proposed methods and various conventional methods, including several Deep Neural Networks based methods. First of all, we can see that the subspacebased methods and the proposed MCM/CMCM achieve comparative or better performances than that of the conventional methods in all the datasets. In particular, it is notable that the proposed methods achieve competitive results with more complex methods using deep learning, such as softmax, MMDML and ADNT. Especially, in ETH80 and Mobo, they show very high recognition rates against these deep learning based methods. The conventional methods do not explicitly consider the structure information of an image set. In contrast, the proposed methods extract effectively the detailed structure information through the convex cone representation. This difference in the classification mechanism leads to the advantage of our methods.
CMCM outperformed MCM in all the cases. This indicates that projecting onto the discriminant space can capture useful geometrical information to increase the class separability among the class convex cones, as we expected. CMSM also improves the performance of MSM. However, the improvement degree by CMCM is larger than that of CMSM. This implies that the discriminant space works better with convex cone representation to enhance the class separability among class cones.
The results on ETH80 and Mobo show clearly the effectiveness of both of cone and subspace based methods against the conventional methods. However, it may be difficult to argue the advantage of CMCM against CMSM, since they both realized almost 100 recognition rate with near zero EERs. The databases seemed to be relatively easy for both types of methods to classify.
On the other hand, the YTC is difficult for all the methods, so that we can find apparent difference between the recognition rates of both. To visually confirm this advantage, we calculated the receiver operating characteristic (ROC) curves of four subspace and cone based methods, as shown in Fig.12. The ROC curves indicate clearly the strength of CMCM against CMSM. This superiority is also supported by the average the area under the curve (AUC) as follows: CMSM and CMCM are 0.9002 and 0.9341 respectively.
4.4 Robustness against limited training data
A major issue with deep neural networks is the requirement of a large number of training samples to learn the networks accurately. Therefore, the robustness against small sample size (SSS) is a necessary characteristic for effective methods using CNN features in practice. In this experiment, we evaluated the robustness of the different methods against SSS using our private multiview hand shape dataset (Ohkawa and Fukui, 2012).
4.4.1 Experimental protocol
The multiview hand shape dataset consists of 30 classes of hand shapes. Each class data was collected from 100 subjects at a speed of 1 fps for 4 s using a multicamera system equipped with seven synchronized cameras at intervals of 10 degrees. During data collection, the subjects were asked to rotate their hands at a constant speed to increase the number of viewpoints. Figure 13 shows several sample images in the dataset. The total number of images collected was 84000 (= 30 classes4 frames7 cameras 100 subjects).
We randomly divided the subjects into two sets. One set was used for training, and the other was used for testing. We evaluated the performances of the methods by setting the numbers of subjects used for training to 1, 2, 3, 4, 5, 10, and 15. In each case, the total number of training images was 30 classes7 cameras4 frames subjects, (). We set the number of subjects used for testing to 50. As an input image set, we used 28 (=7 cameras 4 frames) images of a subject. Thus, the total number of convex cones for testing was 1500 (=30 classes50 subjects).
To extract CNN features from the images, we used the finetuned ResNet by using the training images under the experimental conditions.
softmax  MSM  CMSM  MCM  CMCM  

1  36.07  62.27  65.87  63.07  67.87 
2  71.41  73.47  74.73  74.60  75.33 
3  83.87  85.27  87.40  85.67  87.47 
4  86.60  87.60  91.00  88.27  91.33 
5  91.60  91.13  92.87  92.07  93.53 
10  95.73  95.27  95.73  95.40  96.27 
15  96.53  96.20  96.27  96.67  97.00 
4.4.2 Results and discussion
Table 2 shows the accuracies versus the number of training subjects. From the table, we can see that the overall performance of CMCM was better than that of the other methods. In particular, CMCM works well when the number of training subjects is small. For example, when is 1, CMSM and CMCM achieve an error rate of about half that for softmax. Moreover, CMCM outperforms the subspace based methods, MSM and CMSM. This further indicates that the convex cone based method can represent the distribution of a set of CNN features more accurately than the subspace based methods.
5 Conclusion
In this paper, we proposed a method based on the convex cone model for imageset classification, referred to as the constrained mutual convex cone method (CMCM). We discussed a combination of the proposed method and CNN features, though our method can be applied to various types of features with nonnegative constraint.
The main contributions of this paper are 1) the introduction of a convex cone model to represent a set of feature vectors compactly and accurately; 2) the definition of the geometrical similarity of two convex cones based on the angles between them, which are obtained by the alternating least squares method; 3) the proposal of a method, i.e., MCM, for classifying convex cones using the angles as the similarity index; 4) the introduction of a discriminant space that maximizes betweenclass variance (gaps) between convex cones and minimizes withinclass variance; and 5) the proposal of the constrained MCM (CMCM), which incorporates the above projection into the MCM.
We verified the effectiveness of multiple angles and the discriminant space which are the essence of the proposed frameworks through two experiments. Then, we evaluated the classification performances of the proposed methods by comparing with various types of conventional methods. The proposed methods achieved competitive results, whether the number of training samples is large or small.
Acknowledgements.
Part of this work was supported by JSPS KAKENHI Grant Number JP16H02842.References
 Afriat (1957) Afriat SN (1957) Orthogonal and oblique projectors and the characteristics of pairs of vector spaces. In: Mathematical Proceedings of the Cambridge Philosophical Society, vol 53, pp 800–816
 Azizpour et al. (2016) Azizpour H, Razavian AS, Sullivan J, Maki A, Carlsson S (2016) Factors of transferability for a generic ConvNet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(9):1790–1802
 Belhumeur and Kriegman (1998) Belhumeur PN, Kriegman DJ (1998) What is the set of images of an object under all possible illumination conditions? International Journal of Computer Vision 28(3):245–260
 Bro and De Jong (1997) Bro R, De Jong S (1997) A fast nonnegativityconstrained least squares algorithm. Journal of Chemometrics 11(5):393–401

Cevikalp and Triggs (2010)
Cevikalp H, Triggs B (2010) Face recognition based on image sets. In: Computer Vision and Pattern Recognition, IEEE, pp 2567–2573
 Chen et al. (2016) Chen JC, Patel VM, Chellappa R (2016) Unconstrained face verification using deep CNN features. In: 2016 IEEE Winter Conference on Applications of Computer Vision, pp 1–9

Feng et al. (2016)
Feng Q, Zhou Y, Lan R (2016) Pairwise linear regression classification for image set retrieval. In: Computer Vision and Pattern Recognition, pp 4865–4872
 Fisher (1936) Fisher RA (1936) The use of multiple measurements in taxonomic problems. Annals of Human Genetics 7(2):179–188
 Fukui and Maki (2015) Fukui K, Maki A (2015) Difference subspace and its generalization for subspacebased methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(11):2164–2177
 Fukui and Yamaguchi (2005) Fukui K, Yamaguchi O (2005) Face recognition using multiviewpoint patterns for robot vision. In: The Eleventh International Symposium of Robotics Research, pp 192–201
 Fukui and Yamaguchi (2007) Fukui K, Yamaguchi O (2007) The kernel orthogonal mutual subspace method and its application to 3D object recognition. In: Asian Conference on Computer Vision, pp 467–476
 Fukui et al. (2006) Fukui K, Stenger B, Yamaguchi O (2006) A framework for 3D object recognition using the kernel constrained mutual subspace method. In: Asian Conference on Computer Vision, pp 315–324
 Georghiades et al. (2001) Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6):643–660
 Gross and Shi (2001) Gross R, Shi J (2001) The CMU motion of body (MoBo) database. Tech. Rep. CMURITR0118, Carnegie Mellon University, Pittsburgh, PA
 Gross et al. (2010) Gross R, Matthews I, Cohn J, Kanade T, Baker S (2010) Multipie. Image and Vision Computing 28(5):807–813

Guanbin Li and Yu (2015)
Guanbin Li, Yu Y (2015) Visual saliency based on multiscale deep features. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp 5455–5463
 Hayat et al. (2015) Hayat M, Bennamoun M, An S (2015) Deep reconstruction models for image set classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(4):713–727
 He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition
 Hotelling (1936) Hotelling H (1936) Relations between two sets of variates. Biometrika 28(3/4):321–377
 Kim and Park (2008) Kim H, Park H (2008) Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications 30(2):713–730
 Kim et al. (2008) Kim M, Kumar S, Pavlovic V, Rowley H (2008) Face tracking and recognition with visual constraints in realworld videos. In: Computer Vision and Pattern Recognition, IEEE, pp 1–8
 Kim et al. (2007) Kim TK, Kittler J, Cipolla R (2007) Discriminative learning and recognition of image set classes using canonical correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(6):1005–1018
 Kobayashi and Otsu (2008) Kobayashi T, Otsu N (2008) Conerestricted subspace methods. In: International Conference on Pattern Recognition, pp 1–4
 Kobayashi et al. (2010) Kobayashi T, Yoshikawa F, Otsu N (2010) Conerestricted kernel subspace methods. In: IEEE International Conference on Image Processing, pp 3853–3856
 Lee and Seung (1999) Lee DD, Seung HS (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401(6755):788
 Lee et al. (2005) Lee KC, Ho J, Kriegman DJ (2005) Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5):684–698
 Leibe and Schiele (2003) Leibe B, Schiele B (2003) Analyzing appearance and contour based methods for object categorization. In: IEEE Conference on Computer Vision and Pattern Recognition, vol 2, pp 409–415
 Lu et al. (2015) Lu J, Wang G, Deng W, Moulin P, Zhou J (2015) Multimanifold deep metric learning for image set classification. In: Computer Vision and Pattern Recognition, pp 1137–1145
 Lu et al. (2017) Lu J, Wang G, Zhou J (2017) Simultaneous feature and dictionary learning for image set based face recognition. IEEE Transactions on Image Processing 26(8):4042–4054

Nair and Hinton (2010)
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, pp 807–814
 Ohkawa and Fukui (2012) Ohkawa Y, Fukui K (2012) Handshape recognition using the distributions of multiviewpoint image sets. IEICE Transactions on Information and Systems 95(6):1619–1627
 Otsu (1979) Otsu N (1979) A threshold selection method from graylevel histograms. IEEE Transactions on Systems, Man, and Cybernetics 9(1):62–66, DOI 10.1109/TSMC.1979.4310076
 Ross et al. (2008) Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. International Journal of Computer Vision 77(13):125–141
 Russakovsky et al. (2015) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211–252
 Sakano and Mukawa (2000) Sakano H, Mukawa N (2000) Kernel mutual subspace method for robust facial image recognition. In: International Conference on KnowledgeBased Intelligent Engineering Systems and Allied Technologies , vol 1, pp 245–248
 Shah et al. (2017) Shah SAA, Nadeem U, Bennamoun M, Sohel FA, Togneri R (2017) Efficient image set classification using linear regression based image reconstruction. In: Computer Vision and Pattern Recognition Workshops, pp 601–610
 Sharif Razavian et al. (2014) Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) CNN features offtheshelf: an astounding baseline for recognition. In: IEEE Conference on Computer Vision and Pattern Recognition workshops, pp 806–813
 Sogi et al. (2018) Sogi N, Nakayama T, Fukui K (2018) A method based on convex cone model for imageset classification with CNN features. In: International Joint Conference on Neural Networks (IJCNN), pp 1–8
 Tenenhaus (1988) Tenenhaus M (1988) Canonical analysis of two convex polyhedral cones and applications. Psychometrika 53(4):503–524
 Vía et al. (2005) Vía J, Santamaría I, Pérez J (2005) Canonical correlation analysis (CCA) algorithms for multiple data sets: Application to blind SIMO equalization. In: 13th European Signal Processing Conference, pp 1–4
 Vía et al. (2007) Vía J, Santamaría I, Pérez J (2007) A learning algorithm for adaptive canonical correlation analysis of several data sets. Neural Networks 20(1):139–152
 Viola and Jones (2004) Viola P, Jones MJ (2004) Robust realtime face detection. International Journal of Computer Vision 57(2):137–154
 Wang et al. (2008) Wang R, Shan S, Chen X, Gao W (2008) Manifoldmanifold distance with application to face recognition based on image set. In: Computer Vision and Pattern Recognition, IEEE, pp 1–8
 Wang et al. (2017) Wang Z, Zhu R, Fukui K, Xue JH (2017) Matched shrunken cone detector (MSCD): Bayesian derivations and case studies for hyperspectral target detection. IEEE Transactions on Image Processing 26(11):5447–5461
 Wang et al. (2018) Wang Z, Zhu R, Fukui K, Xue JH (2018) Conebased joint sparse modelling for hyperspectral image classification. Signal Processing 144:417–429
 Yamaguchi et al. (1998) Yamaguchi O, Fukui K, Maeda K (1998) Face recognition using temporal image sequence. In: Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp 318–323
Comments
There are no comments yet.