Performance of genetic algorithm optimized Doc2Vec-kNN for classifying space science and adjacent fields documents with heterogenous sampling
Abstract
We assessed the performance of k-nearest neighbor classification on documents consisting of heterogenous class sample. Input data used publicly available titles, abstract, and field of studies of space science and adjacent field articles. Genetic algorithm (GA) was used for hyperparameter tuning, while Doc2Vec was used to transform text into vectors. Results showed that GA optimized Doc2Vec-kNN algorithm performed very well as it can correctly predict the test data >92% on average for all classes. The calculation of the confusion matrix also supported this finding. However, some selected classes performed below 80% due to lower recall and F1-scores.