Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications
by Murphy, Dean, and Raftery
Classifying or clustering (or semi supervised learning) spectra is a very challenging problem from collecting statistical-analysis-ready data to reducing the dimensionality without sacrificing complex information in each spectrum. Not only how to estimate spiky (not differentiable) curves via statistically well defined procedures of estimating equations but also how to transform data that match the regularity conditions in statistics is challenging.
Another reason that astrophysics spectroscopic data classification and clustering is more difficult is that observed lines, and their intensities and FWHMs on top of continuum are related to atomic database and latent variables/hyper parameters (distance, rotation, absorption, column density, temperature, metalicity, types, system properties, etc). Frequently it becomes very challenging mixture problem to separate lines and to separate lines from continuum (boundary and identifiability issues). These complexity only appears in astronomy spectroscopic data because we only get indirect or uncontrolled data ruled by physics, as opposed to the the meat species spectra in the paper. These spectroscopic data outside astronomy are rather smooth, observed in controlled wavelength range, and no worries for correcting recession/radial velocity/red shift/extinction/lensing/etc.
Although the most relevant part to astronomers, i.e. spectroscopic data processing is not discussed in this paper, the most important part, statistical learning application to complex curves, spectral data, is well described. Some astronomers with appropriate data would like to try the variable selection strategy and to check out the classification methods in statistics. If it works out, it might save space for storing spectral data and time to collect high resolution spectra. Please, keep in mind that it is not necessary to use the same variable selection strategy. Astronomers can create better working versions for classification and clustering purpose, like Hardness Ratios, often used to reduce the dimensionality of spectral data since low total count spectra are not informative in the full energy (wavelength) range. Curse of dimensionality!.