Evaluating chemometric strategies and machine learning approaches for a miniaturized near-infrared spectrometer in plastic waste classification

Optimizing the sorting of plastic waste plays a crucial role in improving the recycling process. In this contribution, we report on a comparative study of multiple machine learning and chemometric approaches to categorize a data set derived from the analysis of plastic waste performed with a handheld spectrometer working in the Near-Infrared (NIR) spectral range. Conducting a cost-effective NIR study requires identifying appropriate techniques to improve commodity identification and categorization. Chemometric techniques, such as Principal Component Analysis (PCA) and Partial Least Squares - Discriminant Analysis (PLS - DA)


INTRODUCTION
In the perspective of whole-system economic sustainability, the enormous volume of urban plastic waste and the constant increase in human plastic consumption require a high level of waste valorisation. By the numbers, global plastic production reached 367 million tons in 2021, with Europe accounting for 16 % of the total [1]. 9 % of plastic was recycled, 12 % was incinerated, and 79 % ended up in landfills or natural compartments [2]. The recycling of polymer waste has significant environmental advantages owing to the replacement of primary manufacturing, and waste sorting optimization plays a critical role in the development of the recycling process [3], [4]. Recycling is a technique for plastic product end-of-life waste management [5]. Basically, two types of recycling processes can be distinguished: mechanical and chemical processes [3], [6]. In both, sorting is the most critical stage in the recycling process, and this is true regardless of how effective the recycling program is [3], [4]. The use of automated sorting equipment makes the process more efficient [7]. Usually, these devices rely on vibrational spectroscopic techniques [8]- [11], and camera ABSTRACT Optimizing the sorting of plastic waste plays a crucial role in improving the recycling process. In this contribution, we report on a comparative study of multiple machine learning and chemometric approaches to categorize a data set derived from the analysis of plastic waste performed with a handheld spectrometer working in the Near-Infrared (NIR) spectral range. Conducting a cost-effective NIR study requires identifying appropriate techniques to improve commodity identification and categorization. Chemometric techniques, such as Principal Component Analysis (PCA) and Partial Least Squares -Discriminant Analysis (PLS -DA), and machine learning techniques such as Support-Vector Machines (SVM), fine tree, bagged tree, and ensemble learning were compared. Various pretreatments were tested on the collected NIR spectra. In particular, Standard Normal Variate (SNV) and Savitzky-Golay derivatives as signal pre-processing tools were compared with feature selection techniques such as multiple Gaussian Curve Fit based on Radial Basis Functions (RBF). Furthermore, results were combined into a single predictor by using a likelihood-based aggregation formula. Predictive performances of the tested models were compared in terms of classification parameters such as Non-Error Rate (NER) and Sensitivity (Sn) with the analysis of the confusion matrices, giving a broad overview and a rational means for the selection of the approach in the analysis of NIR data for plastic waste sorting.
systems for the polymer identification of clear and coloured products [5], [12]. Other techniques are based on ultraviolet (UV) spectroscopy [13], [14], X-ray [15], and hyperspectral imaging [16]- [18]. Over the years, this strategy has increased the purity of the output plastic, achieving a high percentage of recyclates in the production of secondary materials. However, these systems reach their limits with mixed plastics that require additional sorting elsewhere and can affect the quality of the recyclate if not appropriately allocated. A positive cost-benefit analysis is only possible if the separated polymer fractions have a high purity grade and satisfy the market demand for high-quality recyclates. Therefore, post-consumer recycling consists of many essential steps: collection, sorting, cleaning, size reduction and separation, and/or compatibilization to reduce polymer contamination [5]. In this scenario, the prospect of combining a well-established polymer identification technology with a small, portable, lowcost, real-time spectrometer for local and intermittent semiautomatic sorting is highly desirable, accompanied by robust data analysis [19], [20]. In recent years, chemometric analysis of nondestructive spectroscopic data has been widely investigated as an automated method for improving plastic sorting systems [21]- [24]. This improvement has been driven by the need to reduce the environmental impact [25]. Recently, machine learning has attracted considerable attention in plastic waste recognition using spectroscopic techniques [26]- [32]. In this study, we compared machine learning and chemometric techniques for classifying plastic waste data acquired with a portable Near-Infrared (NIR) spectrometer (see Figure 1 for the scheme of the work). Comparisons were made between chemometric approaches, Principal Component Analysis (PCA) and Partial Least Squares -Discriminant Analysis (PLS-DA), and machine learning techniques, Support-Vector Machines (SVM), Fine Tree, Bagged Tree, and Ensemble Learning. A comparison was also made in terms of pre-processing: traditional techniques, such as Standard Normal Variate (SNV) and Savitzky-Golay derivatives were examined in contrast to feature reduction techniques, such as multiple Gaussian Curve Fit based on Radial Basis Functions (RBF). The predictive performances of the tested models were compared in terms of classification parameters, such as Non-Error Rate (NER) and Sensitivity (Sn) with the analysis of confusion matrices, providing a comprehensive overview and a rational means of selecting the approach for the analysis of NIR data for plastic waste sorting.

Samples collection
The first batch of plastic samples was collected in the Selection Division of the Montello SpA recovery and recycling plant (Bergamo, Italy), which accepts post-consumer plastic in the form of municipal waste for recycling [20]. Subsequently, the dataset was expanded to include new samples from municipal waste collected before ending up in landfills. A total of 325 samples from a variety of polymer classes were used in this study. Specifically, the products studied were: 75 samples of poly(ethylene terephthalate) (PET), 100 samples of polyethylene (PE), 75 samples of polypropylene (PP), and 75 samples of poly(styrene) (PS). The assortment included bottles, containers, and packaging of various sizes, shapes, and colours.

NIR analysis
Plastic samples were analysed using the MicroNIR On-site spectrometer (Viavi Solutions Inc., CA, United States) in reflectance mode without pre-treatment of the samples. The instrument is a palm-sized, portable spectrometer weighing approximately 250 g and measuring less than 200 mm in length and 50 mm in diameter. The instrument is equipped with a Linear Variable Filter (LVF), coupled to a linear detector array, which operates in the wavelength range 950-1650 nm. Control settings for spectral data acquisition were set to 10 milliseconds integration time and 50 scans, resulting in a short measurement time of 0.25 seconds. A point-and-shoot technique was used to perform 5 replicates for each sample to reduce the effects caused by sample non-uniformity. A total of 1625 spectra were acquired, and acquisition was performed using MicroNIR TM Pro v3.0 software (Viavi Solutions Inc., CA, United States).

Spectral pre-processing and chemometrics
Pre-processing NIR spectral data has become an essential aspect of chemometric modelling. The goal is to eliminate physical events from the spectra to improve subsequent multivariate regression, classification model, or exploratory analysis [33]. In this study, the spectra were retrieved in a single matrix of 1625 × 125 (samples × wavenumbers) and preprocessing was applied using the Savitzky-Golay second derivative method with seven data points and a second order polynomial followed by Standard Normal Variate (SNV). The second derivative was applied to correct the drift effect [34], [35]  in the NIR spectra, while SNV corrects the baseline shift [36]. SNV was calculated as follows [36]: where corr is the spectrum corrected, org is the raw spectrum collected by the instrument, 0 is the value of the mean of the spectrum to be corrected, and 1 is the standard deviation.
In addition, normalization was performed by mean centering. Different chemometric methods were used for the correct evaluation of the data of all analysed samples. PCA was initially applied as an exploratory analysis to investigate the data structure and was performed on 1625 NIR spectra from all polymer classes. Then, PLS-DA was applied as a supervised pattern recognition tool to separate the different commodities. Prior to using PLS-DA, data were split into a training set and a test set using a MATLAB proprietary function. The process was repeated 500 times, generating a different training and test set each time (75 % of the samples belonged to the training set and 25 % to the test set). All chemometric analyses were performed with MATLAB 2021b (The MathWorks, Inc, Natick, MA, USA) using the PLS-Toolbox (Eigenvector Research, Inc. Manson, Washington, USA).

Machine learning and pre-processing
Various machine learning algorithms were applied for classification purposes; SVM, Fine Tree, Ensemble Learning, and Bagged Tree. In addition, a likelihood-based aggregation procedure (here called Combo) was used to integrate the data into a single predictor, and the same procedure was applied with a Monte Carlo Method (MCM) to make a perturbation on raw data, to improve the generalization performance. The chosen hyperparameters are the following: for Fine Tree Gini's diversity index (gdi) was used as split criterion with 100 maximum number of splits; SVM was performed with a linear kernel function with kernel scale equal to 3. Lastly, Ensemble Learning was performed with the Bagged Tree method with 30 cycles of learning. To test the reliability of the system, 200 random extractions were performed for splitting the training and testing set. Again, 75 % of the samples were used for training and the rest for testing. Machine learning methods were performed on three different datasets: the raw data collected as specified in the previous paragraph (2.2), data reduced using the Gaussian RBF curve fit [37], and a dataset obtained combining raw and pre-processed data.
Each curve of the dataset has been fitted using a combination of 12 gaussian functions and a linear interpolation with a seconddegree function, thus reducing the dataset dimension to 12 RBF centres and 12 sigma values. The procedure is as follows: 1. The second order derivative is computed and fed to find detection algorithm for the initial guesses of the RBF centres (here the MATLAB function "Findpeaks" was used with a limitation of 12 peaks maximum and excluding the first and last 20 samples of the spectrum).

A linear regression with a second-degree equation is used
to remove offset and second-order trends. 3. The RBF centres are used as initial guess to an optimization procedure based on a Sequential Quadratic Programming constrained minimization function [38]. The cost function used is reported in (2) where is the frequency of the -th sample, is its raw values, and , , and are the centre, sigma and amplitude of the -th RBF function respectively.
4. The centres and sigmas found are collected as features of the new dataset. The condition posed in (2) on the positive value allows to reduce dynamically the number of RBF functions actually used, while the interpolation removes trends that could hide peaks.
A third dataset combining the two previous dataset (raw and RBF Gaussian fit) is also created simply joining the two tables.
All calculations were performed using MATLAB and Statistics Toolbox release 2021b (The MathWorks, Inc, Natick, MA, USA). Automation of the procedure was implemented using MATLAB functions created in-house.
In Figure 2 the data analysis approach starting from raw data is reported, both for chemometrics and machine learning modelling.

NIR spectra
The main advantage of NIR spectroscopy is that it is a fastresponse analytical technique capable of collecting spectra without prior processing and predicting physical and chemical properties from a single spectrum [39]. The absorption bands in the NIR region are caused by overtones and/or combination bands of primarily carbon-hydrogen vibrations and oxygenhydrogen vibrations. Correct band assignment is difficult since it may be caused by various combinations of fundamental vibrations. Also, overtone vibrations are highly overlapping [40]. Representative NIR reflectance spectra of the four polymers (PE, PET, PP, and PS) are shown in Figure 3. The main absorbance band for PET was found at 1660 nm, which is related to the 1 st overtone of C-H stretching [41], with other two peaks at about 1130 nm and 1415 nm. For PE the peak around 1211 nm is related to 2 nd overtone of methylene C-H group, while the peak at about 1217 nm is related to the C-H stretch [42]. Peaks at 1391 nm and 1168 nm, correspond, respectively, to C-H combination band and 2 nd overtone of CH2 symmetric stretch. Regarding PP, the 2 nd overtone of the asymmetric methyl C-H stretch is around 1193 nm, while the asymmetric methylene C-H stretch occurs at about 1211 nm [43]. The two peaks at 1391 nm and 1397 nm are related to methyl and methylene (C-H) combination. Lastly, for PS the peak at 1205 nm corresponds to the 2 nd overtone of the aromatic C-H stretch; the stretching vibrational mode of C-H which occurs around 1639 nm, and the 1 st overtone of aromatic C-H stretch overlaps with C-H combination band, which occurs at about 1391 nm [42]. To allow comparison between the raw spectra and the same spectra after applying the Savitzky-Golay 2 nd derivative and SNV, Figure 4 shows the representative spectra of the four commodities after pre-processing.

Principal component analysis
The PCA calculation was performed after the pre-processing described above for the entire spectral range. For data structure analysis, PCA is a useful chemometric method. The goal of PCA is to extract the information stored in many variables into a smaller number of variables, called Principal Components [44]. Figure 5 shows the score plot of the first two components (73.88 % of the total explained variability), in which a clear separation between the polymer classes can be seen. Along PC1 PET is distinguished from the other commodities. PET samples show very negative score values, while the other samples show positive score values. On the other hand, along PC2, PS is clearly separated from the other plastics.
A clear separation between PP and PE can be noticed in the score plot of PC1 vs PC3 in Figure 6, where PC3 accounts for 15.83 % of the total information and explains the difference of PP from the other class of polymers.

Partial least squares discriminant analysis
Following the exploratory PCA analysis, a supervised classification technique was used to distinguish the different plastic groups. In PLS-DA, a classification objective is added to    the PLS regression technique. The response variable is categorical and reflects the class to which the statistical units belong. PLS-DA returns the prediction as a vector with values between 0 and 1 and a length equal to the number of classes in the predictor variables [45], [46]. Each time PLS-DA was performed, the parameters such as NER and sensitivity were calculated in fitting, in cross-validation (CV), and for the test set. The cross-validation procedure was based on venetian blind approach with 5 groups. CV was also used to determine the optimal number of Latent Variables (LVs) for each PLS-DA model. Figure 7 shows all sensitivities for each class, calculated for training set, CV, and for test set. The values are close to 1, indicating a very high classification performance. Moreover, the results are very balanced between training, CV, and test set; therefore, overfitting is completely avoided, and the model can be considered reliable and stable. Table 1 shows the NER defined as mean class sensitivity [47], calculated for all the training set, cross-validation, and test set. Overall, 99 % of the samples were correctly classified for each of the 500 iterations.

Machine learning
Due to the complexity and the large number of results, for the machine learning analysis the classification parameters are presented only for the test set. Figure 8 shows the NER of the classes for each computed model and for each treatment of the data. It is noticeable that the models run on raw data have the worst performances. The NER ranges from 0.74 (Fine Tree) to 0.9 (SVM), indicating a high variability in the results. For raw data only SVM can be considered as a satisfactory model for pattern recognition. Lower variability in the results is observed for pretreated data and for a mixture of pre-treated and raw data, where the NER ranges from 0.96 to 0.99 and from 0.96 to 0.98, respectively. Thus, there is no difference in the results between pre-processed data and the combination of raw and pre-treated data. These results confirm that feature reduction based on the Gaussian curve with RBF gives high performances for pattern recognition in machine learning analysis.
In general, the model performance is comparable between machine learning and multivariate analysis methods. After random extraction of training and test data repeated 500 and 200 times for chemometrics and machine learning, respectively, the NER calculated for the test set is above 0.95 for both methods. However, the use of chemometrics reduces the computational time, compared to the computationally intensive machine learning algorithms.

CONCLUSION
This paper included a side-by-side comparison between conventional chemometric methods and machine learning algorithms for the classification of a dataset obtained from the study of plastic waste with a portable Near-Infrared (NIR) spectrometer. Multivariate methods such as Principal Component Analysis (PCA) and Partial Least Squares -Discriminant Analysis (PLS -DA) were investigated, as well as machine learning methods such as Support Vector Machines (SVM), Fine Tree, Bagged Tree and Ensemble Learning. Results were also compared in terms of data processing: signal preprocessing tools, SNV, and Savitzky-Golay derivatives were compared with feature reduction approaches such as Multiple Gaussian Curve Fit based on Radial Basis Functions (RBF). In addition, the machine learning algorithms were run on raw data, pre-processed data, and the combination of the two approaches. The results from PLS-DA showed very high performances for pattern recognition; in fact, the NER for the training set, in CV, and for the test set are all equal to 0.99. In contrast, for machine learning, the NER for raw data ranges from 0.74 for Fine Tree to 0.90 for SVM, indicating high variability in the results. The results for the pre-processed data show lower variability with NER value ranging from 0.96 to 0.99, which is also valid for the combination of raw data and pre-processed data. This confirms that RBF-based variable reduction is the most crucial point to improve classification performances. Contrarily to some results found in the literature regarding the pre-treatment of data having a negative effect in accuracy using chemometrics [48], the pretreatment of data is generally an improvement in the detection accuracy using machine learning techniques. We can conclude that the multivariate and machine learning approaches produce comparable results in terms of model performance. The NER estimated for the test set is above 0.95 for both chemometrics and machine learning after randomly extracting the training and  test data and repeating them 500 and 200 times, respectively. On the other hand, chemometrics is characterised by a lower computation time compared to machine learning algorithms and it can therefore be considered more advantageous.