Identification of actionable tweets during Philippine disasters using machine learning
Abstract
Successful disaster response requires time-critical information, of which one abundant source is social media such as Twitter. However, because of the sheer volume of tweets, as well as their short, unstructured nature, it is challenging and time-consuming to find tweets that contain actionable information. In this study, we developed logistic regression binary classification models to classify tweets on whether they provide information about the on-the-ground situation during a disaster, i.e., their informativeness. We trained and tested our models using tweets from the CrisisLexT26 dataset concerning five disasters which occurred in the Philippines during 2012 to 2013. We compared models using different feature sets extracted using the Bag-of-Words (BoW), and TF-IDF vectorization methods, in conjunction with word embedding using word2vec, and evaluated their accuracy, precision, recall, AUC, and F1-scores. Our results indicate that relatively simple models using BoW and TF-IDF features are able to achieve good performance, comparable to that of more sophisticated models made for similar applications. We attribute this to the difference in scope, which points to the potential of a country-specific training approach. This is supported by our identification of the most important keywords in predicting the informativeness of a tweet. We found that words and hashtags related to calls for donation, impacts of disasters, and specific events are most associated with informative tweets while those associated with sympathy, such as 'pray,' and '#prayforvisayas,' were most associated with non-informative tweets. These models show promise in automating the identification of useful and relevant tweets to support disaster response efforts.