Interpreting experimental Raman spectra of amino acid mixtures via a variational autoencoder-based machine learning approach
Abstract
Amino acid detection holds significant practical value particularly in the fields of biochemical and medical application. Tryptophan and tyrosine are known as the precursors of serotonin and dopamine, respectively, and their metabolic abnormalities are closely associated with neurodegenerative diseases like Alzheimer's disease. This study aims to develop a rapid computational method for identifying the main components for given amino acid mixture based on Raman spectroscopy and machine learning models. We constructed the theoretical Raman spectra of 20 amino acids using density functional theory (DFT), and subsequently generated the arbitrary theoretical mixture spectrum. Machine learning classifiers, namely Random Forest (RF) and XGBoost, were found to predict the dominate amino acid among the random theoretical-mixtures with the accuracy higher than 94.7%. For the task of predicting the principal components of the experimental mixtures, namely mixing phenylalanine (Phe) and glutamic acid (Glu) in arbitrary ratios, the experimental spectrum of Glu-Phe mixture were transformed sequentially by asymmetrically reweighted-penalized least squares fitting (arPLS) and variational autoencoder (VAE) to the style of DFT spectrum. Consequently, RF and XGBoost models were found to be able to predict the leading amino acids among these transformed mixtures with 100% accuracy. We demonstrated that this workflow effectively reduces the discrepancy between theoretical and experimental Raman spectra and substantially improves the practical applicability in biomedical applications.








