Web mining techniques for recommendation and personalization

Xu, Guandong (2008) Web mining techniques for recommendation and personalization. PhD thesis, Victoria University.

Nowadays Web users are facing the problems of information overload and drowning due to the significant and rapid growth in the amount of information and the number of users. As a result, how to provide Web users with more exactly needed information is becoming a critical issue in web-based information retrieval and Web applications. In this work, we aim to address improving the performance of Web information retrieval and Web presentation through developing and employing Web data mining paradigms. Web data mining is a process that discovers the intrinsic relationships among Web data, which are expressed in the forms of textual, linkage or usage information, via analysing the features of the Web and web-based data using data mining techniques. Particularly, we concentrate on discovering Web usage pattern via Web usage mining, and then utilize the discovered usage knowledge for presenting Web users with more personalized Web contents, i.e. Web recommendation. For analysing Web user behaviour, we first establish a mathematical framework, called the usage data analysis model, to characterise the observed co-occurrence of Web log files. In this mathematical model, the relationships between Web users and pages are expressed by a matrix-based usage data schema. On the basis of this data model, we aim to devise algorithms to discover mutual associations between Web pages and user sessions hidden in the collected Web log data, and in turn, to use this kind of knowledge to uncover user access patterns. To reveal the underlying relationships among Web objects, such as Web pages or user sessions, and find the Web page categories and usage patterns from Web log files, we have proposed three kinds of latent semantic analytical techniques based on three statistical models, namely traditional Latent Semantic Indexing, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation model. In comparison to conventional Web usage mining approaches, the main strengths of latent semantic based analysis are their capabilities that can not only, capture the mutual correlations hidden in the observed objects explicitly, but also reveal the unseen latent factors/tasks associated with the discovered knowledge implicitly. In the traditional Latent Semantic Indexing, a specific matrix operation, i.e. Singular Value Decomposition algorithm, is employed on the usage data to discover the Web user behaviour pattern over a transformed latent Web page space, which contains the maximum approximation of the original Web page space. Then, a k-means clustering algorithm is applied to the transformed usage data to partition user sessions. The discovered Web user session group is eventually treated as a user session aggregation, in which all users share like-minded access task or intention. The centroids of the discovered user session clusters are, then, constructed as user profiles. In addition to intuitive latent semantic analysis, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation approaches are also introduced into Web usage mining for Web page grouping and usage profiling via a probability inference approach. Meanwhile, the latent task space is captured by interpreting the contents of prominent Web pages, which significantly contribute to the user access preference. In contrast to traditional latent semantic analysis, the latter two approaches are capable of not only revealing the underlying associations between Web pages and users, but also capturing the latent task space, which is corresponding to user navigational patterns and Web site functionality. Experiments are performed to discover user access patterns, reveal the latent task space and evaluate the proposed techniques in terms of quality of clustering. The discovered user profiles, which are represented by the centroids of the Web user session clusters, are then used to make usage-based collaborative recommendation via a top-N weighted scoring scheme algorithm. In this scheme, the generated user profiles are learned from usage data in an offline stage using above described methods, and are considered as a usage pattern knowledge base. When a new active user session is coming, a matching operation is carried out to find the most matched/closest usage pattern/user profile by measuring the similarity between the active user session and the learned user profiles. The user profile with the largest similarity is selected as the most matched usage profile, which reflects the most similar access interest to the active user session. Then, the pages in the most matched usage profile are ranked in a descending order by examining the normalized page weights, which are corresponding to how likely it is that the pages will be visited in near future. Finally, the top-N pages in the ranked list are recommended to the user as the recommendation pages that are very likely to be visited in the coming period. To evaluate the effectiveness and efficiency of the recommendation, experiments are conducted in terms of the proposed recommendation accuracy metric. The experimental results have demonstrated that the proposed latent semantic analysis models and related algorithms are able to efficiently extract needed usage knowledge and to accurately make Web recommendations. Data mining techniques have been widely used in many other domains recently due to the powerful capability of non-linear learning from a wide range of data sources. In this study, we also extend the proposed methodologies and technologies to a biomechanical data mining application, namely gait pattern mining. Likewise in the context of Web mining, various clustering-based learning approaches are performed on the constructed gait variable data model, which is expressed as a feature vector of kinematic variables, to discover the subject gait classes. The centroids of the partitioned gait clusters are used to represent different specific walking characteristics. The data analysis on two gait datasets corresponding to various specific populations is carried out to demonstrate the feasibility and applicability of gait pattern mining. The results have shown the discovered gait pattern knowledge can be used as a useful means for human movement research and clinical applications.

