Title | The Effect of Preprocessing on Short Document Clustering |
Authors | Koopman, Cynthia and Wilhelm, Adalbert |
Year | 2020 |
Volume | Archives of Data Science, Series A 6(1) / 2020 |
Abstract | Natural Language Processing has become a common tool to extract relevant information from unstructured data. Messages in social media, customer reviews, and military messages are all very short and therefore harder to handle than longer texts. Document clustering is essential in gaining insight from these unlabeled texts and is typically performed after some preprocessing steps. Preprocessing often removes words. This can become risky in short texts, where the main message is made of only a few words. The effect of preprocessing and feature extraction on these short documents is therefore analyzed in this paper. Six different levels of text normalization are combined with four different feature extraction methods. These setting are all applied on K-means clustering and tested on three different datasets. Anticipated results can not be concluded, however other findings are insightful in terms of the connection between text cleaning and feature extraction. |