Performance of genetic algorithm optimized Doc2Vec-kNN for classifying space science and adjacent fields documents with heterogenous sampling

Authors

  • Dominic P. Guaña Philippine Space Agency
  • Arcy Layne L. Sace Philippine Space Agency
  • Paul Leonard Atchong C. Hilario Philippine Space Agency
  • Efren G. Gumayan Iloilo Science and Technology University
  • Gay Jane P. Perez Philippine Space Agency and Institute of Environmental Science and Meteorology, University of the Philippines Diliman

Abstract

We assessed the performance of k-nearest neighbor classification on documents consisting of heterogenous class sample. Input data used publicly available titles, abstract, and field of studies of space science and adjacent field articles. Genetic algorithm (GA) was used for hyperparameter tuning, while Doc2Vec was used to transform text into vectors. Results showed that GA optimized Doc2Vec-kNN algorithm performed very well as it can correctly predict the test data >92% on average for all classes. The calculation of the confusion matrix also supported this finding. However, some selected classes performed below 80% due to lower recall and F1-scores.

Downloads

Issue

Article ID

SPP-2023-PB-04

Section

Poster Session B (Complex Systems, Simulations, and Theoretical Physics)

Published

2023-07-03

How to Cite

[1]
DP Guaña, ALL Sace, PLAC Hilario, EG Gumayan, and GJP Perez, Performance of genetic algorithm optimized Doc2Vec-kNN for classifying space science and adjacent fields documents with heterogenous sampling, Proceedings of the Samahang Pisika ng Pilipinas 41, SPP-2023-PB-04 (2023). URL: https://proceedings.spp-online.org/article/view/SPP-2023-PB-04.