Monte Carlo human identification refinement using joints uncertainty

In this work, we propose a new method to re-identify the same individual among different people using RGB-D data. Each human signature is a combination of soft biometric traits. In particular, we extract a color-based descriptor and a local feature descriptor through a Monte Carlo-based algorithm taking into account the uncertainty of human joints and, applied to each descriptor, refines the similarity match against a spatiotemporal database that updates over time. We analyzed the effects of Monte Carlo refinement in terms of the final maximum matching score obtained for the two descriptors. In addition, we tested the performance of the proposed method on a widely


INTRODUCTION
Nowadays, one challenge still open is the automatic reidentification (RE-ID) of people, which involves re-identifying the same person in environments populated by multiple people. It consists of robustly re-identifying a person even after changes such as occlusions, lighting variability, different backgrounds, and human poses. Most works in this field involve mass surveillance applications [1]. The widespread use of surveillance cameras in public places such as streets, malls, airports, or largescale events makes RE-ID of people mainly worthwhile to increase public safety standards [2]. Other applications range from long-term pedestrian tracking [3] to human activity recognition [4]. For each of them, automatic classification systems can identify and differentiate people based on many prerecorded or real-time videos with low human effort and time.
At the same time, however, the need to have a robust choice of the identified person is essential in the industrial environment, where autonomous mobile robots operate in environments populated by humans and interact with them in several ways [5], [6]. The goal is to preserve the safety of the operators while supporting their work: proper identification (ID) of operators plays a key role in this context.
In this study we explored and selected new descriptors and methods to address the task of RE-ID. In each frame, we combined color and 3D data acquired by an RGB-D sensor to extract color, biomechanics, and depth information. We designed a new method to select and track a person with high accuracy suitable for industrial applications, considering the uncertainty of human joints. To demonstrate the performance of the proposed method, we first evaluated the Monte Carlo refinement effects and then the whole method in terms of recognition rate on a public BIWI RGBD-ID dataset. Furthermore, we compared the obtained results with a public RE-ID neural network provided by the OpenVINO toolkit [7] on the same dataset.
The rest of the paper is organized as follows. Section 2 outlines the state of the art in RE-ID techniques. In Section 3, we describe the method developed. In section 4, we discuss the effects of Monte Carlo refinement and validation. In the final section, we draw conclusions.

RELATED WORK
Most of the techniques used in literature for RE-ID exploit soft biometric traits to distinguish humans from their peers. Soft biometric traits [8] are physical (e.g., skin color, eye color, hair color, height, weight, gender, race, etc.), behavioral (e.g., gait, keystroke, signature, etc.), or adhered human characteristics (e.g., clothes color, tattoos, accessories, etc.). Their advantages include compliance with natural human description labels and nonobtrusiveness in data acquisition. In general, colour is the most common factor affecting the RE-ID performance and is often encoded in histograms [9], [10]. Its inconsistency is the most prominent with respect to viewpoints and poses changes [11]. For these reasons, we also used a color-based descriptor in our method. Other approaches use matching based on local feature detectors and descriptors, such as Speed Up Robust Features (SURF) [12]. In [13], [14], the SURF points of interest also cover portions of the background, and their number and position are strongly influenced by the person's pose, thus making the final RE-ID less accurate. In our work, we used the same SURF descriptor but on points of interest at known positions to avoid that.
There are few studies in which descriptors for RE-ID are based only on 3D information [15]- [18]. In these works, the advantage of using descriptors based only on 3D allows their use even when changing clothes or under conditions of significant illumination change. However, this limits the final performance of RE-ID due to the limited and inaccurate information available. In [19], the depth features, such as normalized measurements of body parts, are calculated from the positions of joints. Their solution is limited by the joint extraction method not discussed in their article and, more generally, by the depth resolution.
To overcome these limitations, many works in the literature propose descriptors that integrate colour and 3D data to improve the RE-ID accuracy. In particular [20] combines clothing appearance descriptors with anthropometric measures extracted from depth data. Anthropometric measurements, such limb length, are strongly influenced by the method of joint extraction and the person's pose, so qualitative feedback on their estimation is insufficient to obtain a robust descriptor. Also, in [21], their descriptors do not consider the hand or ankle joints because, in contrast to our work, there is a lack of information about their uncertainty. Another example of the combined use of RGD-D data is [22]. The biometric data here involved are colour histograms split between the upper and lower body and on the subjects' height. In this case, the histograms consider the background reducing the final accuracy of RE-ID. In the case of occlusions, their method loses the height information, which is the only parameter they extract from the 3D information. More recent work [23] uses a single color-based descriptor generated from a partition grid, the size of which depends on skeletal posture obtained using 3D information. The points in each person's cloud are then grouped according to position in the grid. It requires an updated database for each posture, and the management of these coloured point clouds needs a higher computational cost than our approach, in which masks are used for a colour-based descriptor in each anatomical part from the position of the joints. In addition, because of the imperfect point cloud, some colour information may be lost, especially at the edges of human anatomical parts. Finally, using a single colourbased descriptor makes their RE-ID result less accurate.
All the previous works use direct strategies to extract features. The recent spread of deep learning has promoted the use of machine learning techniques to learn similarity models from training samples [24], [25]. However, learning-based strategies are time-consuming for training and testing steps and often unpredictable, which is why they are often used in industry only as support.

METHODOLOGY
The main steps of the proposed method are shown in Figure 1. Using a neural network, we first extrapolate the uncertainty ellipses related to the position of human joints. Then we select two descriptors, one based on colour, and one based on local feature, for pseudo-random points within the uncertainty ellipses of each joint. Afterwards, we compare the selected descriptors for each joint position to those stored in a spatiotemporal database used as ground truth. A sequential Monte Carlo method refines the joint positions for each descriptor to obtain the best matching score. Finally, the highest descriptor scores are concatenated to get the total RE-ID score.
We performed RE-ID only when a minimum number of 10 out of 18 joints is found and when the overlapping percentage of the bounding boxes of the selected person with the rest of the people is less than 30 %. In addition, since the RE-ID of people is also short-term, we assumed that people to be re-identified do not suddenly change clothes or accessories.

Human pose estimation with joints uncertainty
To estimate the human pose, i.e., joints and connections between them for each person within an image, we applied a neural network provided by Intel in the OpenVINO toolkit, called human-pose-estimation-0001. OpenVINO toolkit is a collection of libraries for computer vision. It Enables CNNbased deep learning inference and contains an optimized version of OpenCV [26] libraries for Intel hardware. This network is based on OpenPose [27] approach with tuned MobileNet v1 [28] as a feature extractor. The network is also optimized to run on Inference Engine, which is a high-performance engine for neural networks developed by Intel; it allows inference on many Intel hardware such as Intel CPU, Processor Graphics, and FPGA.
The detection of humans carried out with this network is of the type Bottom-Up; this means that it does not detect all the human figures in the frame but looks for single interesting points and then, from them, tries to associate together parts/joints to obtain the pose associated to that human. With this net is possible to obtain up to 18 joints per person: ears, eyes, nose, neck, shoulders, elbows, wrists, hips, knees, and ankles. The network is validated on the COCO Dataset [29].
An RGB image is given as input to the network. The net produces two different outputs. The first consists of 18 probability maps, one for each joint, called heatmaps; the second consists of 19*2 layers, one for the horizontal and one for the vertical direction, called Part Affinity Fields (PAFs). With the first set of maps, it is possible to get all the joints in the image by obtaining the positions of the peaks. The second set, on the other hand, provides information on how to match the joints that correspond to a single person. In particular, by taking two possible points and the segment between them, it is possible to determine whether their pair belongs to the same person by checking whether the orientation and position of the segment on the frame match the orientation of the PAF unit vector. This is done by evaluating the scalar product of the segment orientation and PAF value at a discrete number of points within the same segment. If the orientation of the segment at these points is less than a threshold, the joint pair is saved. Finally, this network returns up to 18 joints for each person, assigning them the correct joints by listing pairs of valid joints that share a joint with another pair.
We used the heatmap of each joint to extrapolate the ellipse of uncertainty related to the position of the joints and modified the way of extracting the joints from the heatmaps. Associating an ellipse of uncertainty with their positions is even more important for how the OpenVINO neural network works for their extraction. It does not consider the joints' positions in previous frames, which causes their positions to change independently between frames.
To extract the ellipse of uncertainty of each joint, we first resize the probability map since the output map of the network is smaller than the input image. In Figure 2, there is an example of the non-symmetric probability distribution of a joint. Then we cut the probability map at a certain threshold, set it at 50 % probability, and obtain the contour area to select points for descriptors. The threshold percentage value affects the Cheesman constant [30], which changes the axes' dimensions of the ellipses and, accordingly, their final dimensions. Assuming a percentage higher than 50 % would result in a larger ellipses size, which for our purpose would result in a higher risk of considering the background instead of looking for the correct position of the joints. This percentage may change depending on the resolution of the image input to the neural network.
Before proceeding, we mixed the information from the 2D RGB image with the point cloud provided by the depth camera, Figure 3, allowing us to better estimate the position in the third dimension of each point within the selected area and remove those that are not on the surface of the human body through median filtering in the depth dimension. With the remaining points, we performed a weighted average with their weights given by the heatmap to find the new positions of each joint, i.e., their centres of mass, and around which to approximate a secondorder surface. We discretely calculated the Hessian matrix by taking eight points around the new mean joint positions. Each term of the Hessian matrix can be approximated by taking the difference between two instances of the gradient vector evaluated at nearby points of a generic joint i at row ri and columns ci.
The Hessian matrix terms are replaced by: where with n the difference parameter in pixel. If n is too small there is a risk of considering high noise, while if n is too large, the result may be without meaning because it is averaged. For our application, we chose n = 4 pixels in accordance with the 9 × 9 box filters used in the SURF descriptor to approximate Gaussian second-order derivatives [12].
Finally, the uncertainty ellipse was evaluated with the covariance matrix to obtain a statistical description of the joint positions.
The covariance matrix corresponds to the negative of the inverse of the Hessian matrix [31]. Following this approach, we generate an object for each joint with 3 elements: pixel point, covariance matrix, and probability ellipse, Figure 4.
The results of the positions of the joints calculated through their centres of mass, as explained above, and through the peaks of the heatmaps, as output from the OpenVINO network, are shown in Figure 5. Figure 5 shows how, by extracting the position of each joint considering its statistical distribution from the heatmap and applying a depth filter to the selected points around it, the key points result more in the centre of human joints.

Descriptors matching
After the human pose estimation, finding the selected user in the frame is necessary. From the result of the joint detections, for each descriptor, we select a fixed number of pseudo-random points that follow a Gaussian probability distribution within the ellipse of uncertainty of each joint. To select these points, we first rotated each uncertainty ellipse using the respective eigenvector matrix. Then we translated them into the origin through the data of the positions of the centres of mass of the joints, so that the covariance matrices are symmetric and thus the eigenvectors are mutually orthogonal and the pseudo-random points to select could be considered independent along the two semi-axes of the ellipses. Once selected, the points were reprojected into the correct reference system.
After an initial analysis, we chose a colour-based histogram (HIST) descriptor and the SURF descriptor for our human signature. Each of the two descriptors compared with the ones stored in a database for comparison provides an independent result in terms of joint positions with the best match because to use the histogram-based descriptor, we approximated anatomical parts with geometric figures, which is a simplification not corresponding to reality.
In particular, the HIST descriptor is evaluated within the selected masks by analysing the Hue (H) channel of the HSV colour space. The mask was obtained for the face and torso by connecting joints on their edges. In contrast, for the upper and lower limbs, starting from the direction normal to each pair of joints, quadrilateral masks were found by adding additional points at a fixed length from the selected segment. This length is reduced or increased according to the user's distance from the camera. In the example of Figure 6, the width of the limb masks is set to 8 pixels. Before using the H-channel histogram as a descriptor, we analysed the most frequently chosen colour spaces, L*A*B* and HSV [32]. The H channel of the HSV colour space is the primary colour attribute and represents the phase angle of the colour, which is the usual name of the colour. Compared to L*A*B*, HSV encapsulates colour information in a way more similar to how the human eye perceives colour. Instead of defining a colour in terms of a combination of colour, the HSV colour model describes a colour with only one channel. In addition, H values in HSV are more robust to external light changes [33]. Also from our tests, where we compared the Hchannel and the L*A* channels of HSV and L*A*B* colour spaces, respectively, the H-channel provides more robust behaviour in terms of light changes and colour description for processed images.  To express the similarity between the obtained histogram H1 for each anatomical part and the one selected from the database H2, we used the OpenCV correlation metric SH(H1,H2) where ̅ = ∑ ( ) and nb is the total number of histogram bins. For our test, we set nb equal to 40 bins.
Only for the masks of the lower and upper limbs, we considered pseudo-random points around the selected joints because these limbs are the most prone to errors in joint placement, especially in the direction normal to the limb since there is a greater possibility of considering parts of the background as well. Each limb is divided into two anatomical regions for a total of eight anatomical parts. The pseudo-normal points for each joint at the end of each anatomical region are projected orthogonally only in the direction normal to the vector linking the two joints.
Instead, the SURF descriptor produces feature vectors of 64 floating elements for each selected point inside the uncertainty ellipse. The obtained descriptor vectors with similar values to those stored in the database are close in cosine similarity distance and far apart for different values. The cosine similarity SC is defined as the cosine of the angle between two vectors, A and B, on dimension ℝ . The higher the cosine similarity, the higher the probability that the features are similar because the vectors are more aligned.
where Ai and Bi are components of vectors A and B respectively. Matching for each descriptor is performed only for joints that exist for both the user in the frame being compared and the one in the database. With the constraints imposed for both descriptors, the working domain is 2D.

Monte Carlo Refinement
The selected pseudo-random points for each joint with their similarity scores (SH (H1,H2),SC(A,B)) given by the matches with the database of the two descriptors are used as input for the Sequential Monte Carlo (SMC) method, also known as Particle Filter [34]. The idea behind SMC refinement for both descriptors is iteratively changing the joint positions inside the uncertainty ellipses to find the ones with a higher probability of matching in case the subject in the frame chosen to compare with the one in the database is the same.
The algorithms used for each of the two descriptors are reported as pseudocodes in Appendix A and explained below with an example. For the HIST and SURF descriptors, in this example, we considered the anatomical region of the right upper arm and left ankle, respectively. For the comparison, two identical frames were taken to be used both as ground truth for the database and as a frame to be compared. It allowed us to ensure an exact match between the two frames with a maximum score of 1 for both descriptors.
For the HIST descriptor applied to the right upper arm are first generated 10 pseudo-random points for both the joint at the end of the anatomical area within the selected joint uncertainty ellipse and filtered in depth, Figure 7(A) green dots. All the pseudo-random points are then projected orthogonally in the direction normal to the vector linking the two joints, Figure 7(A), red dots. Moving to the left and right of each point by a fixed distance, in this case 8 pixels, the right upper arm is simplified into a rectangle and the histogram within the resulting mask is extracted. Instead, for the database is used the rectangle obtained from the initial positions of the joint mass centres and compared with each rectangle obtained from each pair of points. Then, at each new step, SMC takes only half of the first pairs of points with the highest matching score with the database and generates new positions for these points. It moves each point left and right in the direction normal to the linking vector and with a distance proportional to the value of the ellipse's minor semi-axis with a trend of the logarithmic function in base 10: for each point, if the score found is low it moves far away from the previous position, instead if the score is high, it moves slightly away. If a newly generated point ends up outside the uncertainty ellipse, its position is not updated but remains the previous one. Each iteration saves the maximum score obtained from the database comparison until the maximum number of iterations is reached. Figure 7(B) shows the motion of the joints' position after 3 iterations. The maximum match for the HIST descriptor passes from 0.88 to 0.99.
A similar approach is used for the SURF descriptor. In this case, we did not consider two joints at a time but a single joint with possible movements in both directions. Figure 8 shows an example of 20 pseudo-random points generated for the left ankle within the selected joint uncertainty ellipse. The SURF descriptor scores were obtained by comparing each point with the descriptor associated with the joint centre of mass used for the database. Then, at each new step, new positions are generated starting from the previous ones, moving them in a random direction and a distance proportional to the value of the ellipse's minor semi-axis with a logarithmic trend. As seen in Figure 8(B), after 7 iterations, the points moved to the areas of the highest match from an initial maximum match of 0.94 to 0.99. Around the maximum peak, the matching decreases while still having local disturbance peaks, making it impossible to predict the surface pattern with a polynomial.
For both descriptors, such a high score already at the first step is also due to the fact that the same frame was used in this example for the database and the one to compare. For the same reason, the score obtained with SMC refinement is very close to 1.
Once the highest match is found for each anatomical part and each joint for HIST and SURF descriptors, the final score is obtained by concatenating these results for each person with the following expressions: where ( 1 , 2 ) is the vector containing all pairs of points with the highest matches obtained with the HIST descriptor, ( , ), on the other hand, contains a point for each joint with the highest matches obtained with the SURF descriptor; nk and mk. are the number of anatomical parts and joints found with the HIST and SURF descriptor, respectively. Finally, the total similarity score obtained for person ID is the average of the two scores HIST and SURF . When a user is identified, the descriptor vectors to be used as ground truth for the database are updated; or when the angle between limbs or light variations exceed 30 %. We monitor the light with the L* channel of L*A*B* spatial color. In addition, descriptor vectors for the database are calculated using the position of the centers of mass of each joint. All databases are also classified based on 3D information about user orientation. The database size selected for each user orientation in our application was set to 10 vectors for both descriptors.

VALIDATION
We assessed the performance of the proposed RE-ID method by first evaluating the effects of Monte Carlo refinement and then the RE-ID results obtained with the whole method. The whole validation was implemented in MATLAB 2020b.
We carried out the assessment on the most widely used dataset for re-identifying people with RGB-D images, such as the BIWI RGBD-ID dataset [35]. Although it is an RGB-D dataset of people targeted to long-term people RE-ID, we selected 28 people in the training sets in which, for each person, about 300 images were collected on the same day and in the same scene, so that the people were dressed in the same way. People moved slightly in the "Still set", while in the "Walking set", each person walked from different viewing angles. In addition, the same number and subject ID are present in both sets.

Monte Carlo Refinement
To show the performance of the SMC refinement, we tested its effects on both descriptors. We showed an example of the results of the scores obtained with each descriptor using the single image in Figure 9(A) from a "Still set" of the BIWI RGBD-ID dataset as database. As frames for comparison, however, we used the remaining frames from the "Still set" of the user in Figure 9(A), another "Still set" from a different user in Figure 9(B), and their "Walking set", Figure 9(C) and Figure 9(D), respectively of the two users. We also tested the effect of SMC refinement by comparing the database user with another user's dataset to verify that, in the wrong case, the SMC refinement does not increase the final scores in a way that would compromise RE-ID. Figure 10 shows the results of the HIST descriptor. The results are obtained by concatenating the scores from each limb but not considering the torso and face masks to consider only the anatomical areas where SMC refinement acts. By concatenating the final score also with the missing areas, mainly with the torso area, the final solution can only be better and more robust because of the size and location of that area. In particular, Figure 10(A) shows the results obtained by comparing the database image of Figure 9(A) with the "Still set" in which the user is the same or the wrong one. As shown in Figure 10(A), in the case of the correct user, the SMC refinement improved the final descriptor score in each frame. However, it is high even without the refinement because the user moved slightly in this dataset, and the lighting conditions can be considered constant. In Figure 10(B), the same database image is compared with the "Walking set". In this case, the contribution of refinement is greater because of the user's movement in front of the camera and from a different angle. It demonstrates how refinement is possible with this descriptor, even if lighting conditions and user kinematics change and even with blurred frames. In particular, frame 32 in Figure 9 (C) and Figure 9 (D) corresponds to the moment when users start walking from a different viewing angle. In both cases in Figure 10, comparing the user with the wrong one, refinement increases the final descriptor score but in a way that does not compromise the RE-ID. It is also true that the lower the score, the easier it is to find a better solution by performing SMC refinement. The importance is to keep this increment controlled, as in our case.
In contrast, Figure 11 shows the results obtained with the SURF descriptor by concatenating the results of each joint. This descriptor is strongly influenced by user kinematics. Figure 11(A) and Figure 11(B) refer to a "Still set" with the same user of the database and one with the wrong user, respectively. Because of the slight movements of the user, comparing all frames with the single image selected for the database of Figure 9(A) is acceptable. If, on the other hand, we compare the database image with a "Walking set", Figure 11(C), even if the user is the same between the database and dataset, the scoring trend after the  Figure 10. Results of the HIST descriptor score for each frame using the "still set" (A) and "Walking set" (B), in both cases comparing the same users (red asterisks) or the wrong one (blue asterisks) without the SMC refinement or with it (green symbols). refinement is up and down because comparing the same database image for all possible kinematic poses of the user is a wrong assumption. In this case, the only acceptable comparison can be made for the first few frames where the user is farther from thecamera, but his kinematics does not change. To overcome this problem, we use multiple database images stored in the proposed method, considering not only the user's orientation but also the user's kinematics.

RE-ID ranks
We assessed the performance of the proposed RE-ID method using the Cumulative Matching Characteristics (CMC) metric [36], which is commonly used for evaluating RE-ID algorithms. For every k from 1 to the number of training subjects, the CMC expresses the average person recognition rate computed when the correct person is included in the k best classification scores (rank-k). The most relevant ranks for RE-ID are the first ranks. In particular, the rank-1 matching rate refers to the probability that the probe image and the image that ranks first in similarity in the search gallery belong to the same subject. Therefore, a higher recognition rate at rank-1 means the correct ID of the subject in the highest rank and thus better model performance since the chance of having false positives with incorrect subject ID is low.
We selected the "Still set" of images from BIWI RGBD-ID as the probe, while the "Walking set" as the gallery (database). The matching for every subject in the probe with respect to the gallery provides a ranking, Figure 12. This procedure is repeated for every subject and then averaged to obtain the final recognition rate.
On the same dataset, we ran the RE-ID OpenVINO demo [37] to compare with our method in terms of CMC curves. The OpenVINO demo detects pedestrians in the frames through the pedestrian detector network person-detection-retail-0013 and reidentifies them through the pedestrian RE-ID model for a general scenario person-reidentification-retail-0277. The choice of this pre-trained model for the RE-ID is backed by the superior Market-1501 rank-1 [38] accuracy of 96.2 %. This demo was modified to use the RE-ID network to compare each probe set frame with all the gallery frames.
The CMC curves obtained by our method and the RE-ID of OpenVINO are shown in Figure 13.
Both curves in Figure 13 showed a high recognition rate at rank-1. This is due to the goodness of the methods but also to the way the images of the BIWI RGBD-ID dataset were acquired: the camera was placed in the same position and fixed throughout the time of the tests; also, the same background was used between the probe and gallery images. However, our method gave a better result than that obtained with OpenVINO: 99.1 % vs 98.4 %. This difference can be increased by changing the background between the probe set and the gallery set since the descriptor extracted by the OpenVINO neural network for RE-ID uses a whole-body image as input and thus goes to consider parts of the background as well.
In addition, CMC over-penalizes false positives while ignoring missed RE-IDs. The decision of whether the selected subject is present in the gallery depends on the choice of the similarity threshold used. In our method, this threshold can be set to a high value without missing RE-IDs, thanks to the SMC refinement process of matching descriptors.

CONCLUSIONS
In this paper, an accurate method for the RE-ID of people was presented. Using RGB-D data as a combination of soft biometric traits for similarity matching, we extrapolated a colorbased descriptor and a local feature descriptor considering the uncertainty of human joints.
In contrast to the OpenVINO RE-ID approach, our method does not require any effort for the learning steps, and therefore the features extracted do not depend on the set of images on which the training was performed. It is also more stable than a learning-based strategy because it uses direct strategies to extract features.
Thanks to its properties, its use is possible in environments where incorrect ID is not allowed because it would increase the risk to operators, such as in industrial environments. With our method, this constraint keeps the RE-ID rate high because, due to the high similarity value obtained by sequential Monte Carlo refinements in descriptor matching, we can use high similarity thresholds without risking not identifying the selected subject.
We achieved a 99.1 % recognition rate at rank-1 on the BIWI RGBD-ID public dataset. This result represents a significant improvement over previous approaches in the literature obtained using the same public dataset.