Robust Text Mining in Online Social Network Context

WANG Ye-thesis_nosignature.pdf - Submitted Version (2MB) | Preview

Wang, Ye (2018) Robust Text Mining in Online Social Network Context. PhD thesis, Victoria University.


Text mining is involved in a broad scope of applications in diverse domains that mainly, but not exclusively, serve political, commercial, medical and academic needs. Along with the rapid development of the Internet technology in recent thirty years and the advent of online social media and network in a decade, text data is obliged to entail features of online social data streams, for example, the explosive growth, the constantly changing content and the huge volume. As a result, text mining is no longer merely oriented to textual content itself, but requires consideration of surroundings and combining theories and techniques of stream processing and social network analysis, which give birth to a wide range of applications used for understanding thoughts spread over the world , such as sentiment analysis, mass surveillance and market prediction. Automatically discovering sequences of words that represent appropriate themes in a collection of documents, topic detection closely associated with document clustering and classification. These two tasks play integral roles in revealing deep insight into the text content in the whole text mining framework. However, most existing detection techniques cannot adapt to the dynamic social context. This shows bottlenecks of detecting performance and deficiencies of topic models. In this thesis, we take aim at text data stream, investigating novel techniques and solutions for robust text mining to tackle arising challenges associated with the online social context by incorporating methodologies of stream processing, topic detection and document clustering and classification. In particular, we have advanced the state-of-theart by making the following contributions: 1. A Multi-Window based Ensemble Learning (MWEL) framework is proposed for imbalanced streaming data that comprehensively improves the classification performance. MWEL ensures that the ensemble classifier is maintained up to date and adaptive to the evolving data distribution by applying a multi-window monitoring mechanism and efficient updating strategy. 2. A semi-supervised learning method is proposed to detect latent topics from news streams and the corresponding social context with a constraint propagation scheme to adequately exploit the hidden geometrical structure as supervised information in given data space. A collective learning algorithm is proposed to integrate the textual content into the social context. A locally weighted scheme is afterwards proposed to seek an improvement of the algorithm stability. 3. A Robust Hierarchical Ensemble (RHE) framework is introduced to enhance the robustness of the topic model. It, on the one hand, reduces repercussions caused by outliers and noises, and on the other overcomes inherent defects of text data. RHE adapts to the changing distribution of text stream by constructing a flexible document hierarchy which can be dynamically adjusted. A discussion of how to extract the most valuable social context is conducted with experiments for the purpose of removing some noises from the surroundings and efficiency of the proposed.

Item type Thesis (PhD thesis)
Subjects Historical > FOR Classification > 0801 Artificial Intelligence and Image Processing
Current > Division/Research > Institute for Sustainable Industries and Liveable Cities
Keywords text mining; text data; online social data; stream processing; topic detection; document clustering; document classification; multi-window based ensemble learning; MWEL; learning algorithm; robust hierarchical ensemble; RHE
Download/View statistics View download statistics for this item

Search Google Scholar

Repository staff login