A Principal Component Analysis to detect cancer cell line aggressiveness

In this paper, we propose the use of Principal Component Analysis (PCA) as a new post-processing method for the detection of breast and bone cancer cell lines cultured in vitro using a microwave biosensor. MDA-MB-231 and MCF-7 breast cancer cell lines and SaOS-2 and 143B osteosarcoma cell lines were characterized using a circular patch resonator in the 1 MHz – 3 GHz frequency range. The return loss of each cancer cell line was analyzed, and the differences among each other were determined through Principal Component Analysis according to a protocol previously proposed mainly for electrocardiogram processing and X-ray photoelectron spectroscopy. Our results showed that the four cancer cell lines analyzed exhibited peculiar dielectric properties when compared to each other and to the growth medium, confirming that PCA could be employed as an alternative methodology to analyze microwave characterization of cancer cell lines which, in turn, may be deeply exploited as a tool for the detection of cancer cells in healthy tissues.


INTRODUCTION
One defining feature of malignant tumors is represented by the quick creation of abnormal cells that arise beyond their usual boundaries. Moreover, these cells have an uncontrollable reproduction and division rate up to constitute cancerous tissues since they do not respond to the standard signaling system of the body [1]- [3]. As stated by the World Health Organization, in 2020, cancer will be the primary cause of approximately 10 million deaths worldwide. By way of example, 2.26 million cases and 685 thousand deaths of breast cancer and 2.21 million cases and 1.8 million deaths of lung cancer, without forgetting the hundreds of thousands of children who develop malignant tumors each year [4].
An early diagnosis and screening can therefore contribute to reducing mortality and aid in more effective treatment. However, there is a lack of research on cancer behavior due to the various and complex molecular pathways involved in the genesis of tumors [5]. In most cases, the tumor degree -established by the cancer cells' characteristics throughout the tumor lesions' growth -is often used to make the diagnosis. Actually, a series of cancer screening methods, such as biopsy, Computed Axial Tomography (CAT), or scintigraphy, exist but are costly and intrusive.
Biosensors in the microwave field may serve as a complementary or replacement method for early-stage noninvasive prognosis of a variety of illnesses, including malignancies. In this context, the measurement of dielectric properties of biological tissues has achieved significant benefits in biomedical and healthcare due to their high sensitivity, versatility, and reduced invasiveness [6]- [9]. Indeed, this technology has consolidated its use in various fields. For example, Gugliandolo et al. [10] developed a microwave microstrip resonator to measure water vapor for industrial pipeline applications. Likewise, Majcher et al. [11] investigated the possibility of using a dagger-shaped probe to measure soil moisture in agrifood applications. Ultimately, D'Alvia et al. [12]- [14] and Cataldo et al. [15] proposed several applications in the cultural heritage field.
On these bases, microwave-based sensors are now gaining more and more interest in the biomedical field. As highlighted in the literature [16], microwave probes offer the possibility of analyzing living tissue properties through a non-invasive measurement of scattering parameters or complex permittivity [17]- [19] and identifying eventual pathological conditions as a variation in the dielectric properties. Concerning cancer cell and tissue characterization, Maenhout et al. [20] evaluated the dielectric properties (dielectric loss, dielectric constant, and conductivity) of healthy non-tumorigenic cell lines, namely MCF-10A and four breast cancer cell lines (Hs578T, MDA-MB-231, MCF7, and T47D) using an open-ended coaxial probe in 200 MHz to 13.6 GHz range. Again, Zhang et al. [21] proposed a microwave biosensor capable of identifying the grade of colon cancer cell aggressiveness in the 4-12 GHz range. Finally, in previous work [22], we proposed a circular patch resonator for the measurement of cancer cell line aggressiveness (SaOS-2, 143B, MCF7, and MDA-MB-231) through the use of a Lorentzian fit model for the return loss signal processing and a weighted MANOVA (Multivariate Analysis of Variance) to investigate the differences in the three main parameters of interest, namely return loss, resonance frequency and full width at half maximum (FWHM).
This paper proposes a novel methodology to analyze microwave sensor's return loss based on an optimized Savitzky-Golay filter, generally adopted for electrocardiogram processing or X-ray photoelectron spectroscopy [23], [24], and principal component analysis (PCA) to extract meaningful information from the data and present a final classification based on possible similarities between analyzed materials.

Cell culture and Experimental Procedure
As previously described [22], we had the opportunity to test two pediatric human osteosarcoma cell lines, SaOS-2 and 143B [25]- [28], and two human breast adenocarcinoma cell lines, MCF7 and MDA-MB-231 [29], [30] for their dielectric response. In particular, SaOS-2 and MCF7 are low-aggressive osteoblastlike osteosarcoma and low-aggressive breast cancer cell lines, while 143B and MDA-MB-231 are high-aggressive lung-tropic metastatic osteosarcoma and high-aggressive bone-tropic breast cancer cell lines, respectively. Cells were seeded in a standard 60 mm Petri dish at an average density of 8 × 10 5 cells/plate and placed in an incubator at 37 °C with 5 % CO2 for 24 hours to allow cells to form a homogeneous confluent monolayer. During the measurements, all cell types were maintained in 1.5 mL of Dulbecco's Modified Eagle Medium (DMEM) culture medium [31], and eight different dishes were prepared for each cell line. Moreover, eight samples of 1.5 mL pure DMEM were prepared as controls.
A circular patch resonator with a radius of 20.00 mm [22] and a SubMiniature ver. A (SMA) connector placed on the conductive edge was employed to determine the dielectric properties of cell line samples. The key component of the measuring setup is the low-cost portable vector network analyzer MiniVNA-TINY [32], used for measuring the return loss|S11(f)| in the operating frequency range of 1.9 -2.6 GHz. The 700 MHz frequency span was previously evaluated to maximize the resolution of acquired data (0.5 MHz) [22]. As a result, the return loss|S11(f)|was acquired for the eight samples of the five different "materials under test" i. e. different media and cell lines.

Data Elaboration Process
Principal component analysis (PCA) is a multivariate analysis that permits identifying and extracting meaningful information from the data and presenting a final classification based on a multiparametric similarity test and variables reduction [33]. Figure 1 shows a scheme of the applied pre-processing algorithm. All data processing was performed with OriginLab 2017 software.
In particular, PCA is a useful tool to reduce the dimension of a dataset, maintaining only those variables with the highest variance. As a result, all the vectors used to represent the acquired return loss are transposed into a new space with a dimension equal to the number of significant components determined by PCA, and the acquired data may be represented as: where X is the original data matrix containing the return loss data, L is the loading matrix, S is the score matrix based on the eigenvalues derived from the X matrix decomposition, and E is the error matrix, which contains the variance load not explained by the PCA model. The matrix dimension i is the number of acquired samples, j is the signal length, and k is the number of significant components.
Before performing the PCA on the acquired Return Loss data, we applied a pre-processing algorithm, as proposed by Es Sebar et al. [34] for Raman spectroscopy applications: 1) baseline removal through an interactive endpoint weighted (EPW) algorithm for each column vector of X; 2) application of a Savitzky-Golay filter (SGF) using a window length of 14 points and fitted with a secondorder polynomial since SFG flattens peaks less than a moving average smoothing with the same window width [35]; 3) data normalization by subtracting its average value from each X column and scaling by the standard deviation [36]. For the i-th column of X, equation 2 holds: with X* the normalized matrix, X(i,c) the i-th centered vector, and σ the standard deviation of the X(i,c) vector. This normalization is also known as standard normal variate (SNV) transformation.
The principal PCA was performed by applying equation (2) in equation (1): The discriminant analysis based on a cross-validation test was performed as the final analysis, using as many k components as those with an eigenvalue greater than or equal to 3 [37]. Figure 2 presents an example of data processing for DMEM, reporting the initial raw data (Figure 2a) and the three steps for the signal processing (Figure 2 b, c and d) baseline remotion, filtering, and normalization, respectively. In detail, the EPW algorithm translates and nutes the signal so that the tails lie at zero, while the SG filter evaluates a polynomial regression around each point, creating a new smoothed value for each data point. Finally, the SNV transformation permits to center and scale the data without altering their overall interpretation: indeed, if two variables were equally correlated before pre-processing, they would still be strongly correlated in post-processing. Therefore, for each of the forty acquired signals, the background is removed, the spectrum is filtered to improve the signal-tonoise ratio, the normalization is completed, and the PCA is performed.

RESULTS AND DISCUSSIONS
The cumulative variance trend is shown in Figure 3. It is possible to observe that the first three components represent an overall variance of about 92.3 %, given by a contribution equal to 76.6 %, 11.9 %, and 3.8 % for components 1, 2, and 3, respectively. According to the literature [37] this can be considered a satisfactory value, as a balance between cumulative variance and complexity of the system to be analyzed for the subsequent analysis, also taking into account that the fourth component contributes only for 2.5 % of the total variance, while the remaining thirty-six components account for the 5.  As can be seen, in figures 4 (a) and (b), the measurements group together into two macro clusters: one containing the pure medium, highlighted by a dotted rectangle, and a second containing all the tested cell types, highlighted by a dashed rectangle. Nonetheless, in both figures, it is also possible to distinguish five sub-clusters highlighted by the ellipses enclosing similar spectra with a 95 % confidence level. Interesting results can be obtained by focusing on the inclination of these clusters. Indeed, the pure medium revealed a different inclination than those obtained when testing all the cell lines. On the other hand, the two less aggressive cell lines (SaOS-2 and MCF7) have the same inclination as the two aggressive cell lines (MDA-MB-231 and 143B). More in detail, the pure medium cluster and the cluster representing the highly aggressive cell lines (143B and MDA-MB-231) have the same inclination (100° and 90° respectively) both when focusing on PC2 vs. PC1 and PC3 vs. PC1, while the inclination of the cluster representing the lowaggressive cell lines (SaOS-2 and MCF7) is 105° when representing PC2 vs. PC1 and 84° when computing PC3 vs. PC1. As a matter of fact, the inclination of the 95 % confidence interval cluster may be a parameter that can give helpful information on tumor aggressiveness.  combination with the main two components to allow for a cumulative variance higher than 90.0 % (as discussed above), thus improving the fitting of the essential peaks found in the two main components.
Subsequently, we evaluated the cross-validation of the PCA loadings concerning the first three components, and the results are reported in Table 1. This test highlighted that pure DMEM was detected with a prediction accuracy of 100.0 %, SaOS-2, MCF7 with a prediction accuracy of 87.5 %, and MDA-MB-231 with a prediction accuracy of 75.0 %. Finally, 143B cells have a prediction accuracy of 50.0 %. It is worth noting that these discretized prediction accuracy results are strictly related to the number of tested samples. Indeed, when testing 8 samples, every prediction accounts for 12.5 % accuracy. Interestingly, the final average prediction error for the entire data set is 20.0 %. Moreover, these results are in high agreement with that reported in [22], in which the different cell lines were studied with reference to the main Lorentzian fit parameters (return loss, resonance frequency, and FWHM) through a MANOVA test. In particular, in [22], we reported a statistical significance difference between DMEM and all tested cell lines (p < 0.0001), and in this work, we obtained a cross-validation of 100.0 %. Similarly, MANOVA reported a significant (p < 0.5) difference between 143B vs. MCF7 and no significant difference between 143B and MDA-MB-231, and PCA prediction accuracy was 12.5 % and 25.0 %, respectively. Therefore, the procedure reported in this work represents an alternative methodology to distinguish tumor aggressiveness without using any fitting procedure, hence only based on the raw data, whose limit at present consists of the limited number of measurements for each group.

CONCLUSIONS
This paper proposes an alternative methodology to analyze the return loss of tumor cell lines. The method allows for discriminating between groups of different tumor cells, analyzing the appropriately filtered and normalized purchase signal, leading to results in agreement with those obtained with traditional methods, such as the Lorentzian fit.
This methodology is based on a pre-processing algorithm, background removal associated with a Savitzky-Golay filter, a normalization procedure concerning the signal variation, and a subsequent Principal Component Analysis. Results showed good average accuracy of the prediction methodology, confirming the feasibility of PCA also for this kind of signal, whereas it has consolidated applications for processing more complex and multi-peak signals.
As a future development, we expect to realize a "split ring resonator" sensor inducing more peaks in the instrument's uniformity band to evaluate better the reliability of the methodology proposed here.