We live in the age of digital information. The number of internet users is increasing and a tremendous volume of digital content is being generated every day. It became impossible for the internet user to control and effectively manage the huge amount of information items created and delivered. This problematic phenomenon is best known as "Information Overload."
To solve this problem, researchers have been working hard to develop efficient filtering techniques capable of eliminating irrelevant information and presenting relevant information that catch the attention of the users. It is evident to note that not all users have the same needs; this is why the filtering process should take into consideration the profiles of the users.
From basic to state-of-the-art architectures and techniques, this report attempts to present a comprehensive survey of filtering performance when the information is collected in real time.
Preface and Acknowledgments
This report is written by Joseph K. Chahine as a research project to obtain the Master of Research degree from the Lebanese University, Doctoral School of Science and Technology (EDST), in agreement with Université Paul Sabatier (UPS), Toulouse, France.
Dr. Mohand Boughanem, from UPS, is connected to the project as a supervisor. I hereby take the opportunity to thank him for his valuable guidance and support.
I would also like to thank Netiks International, the company I work for, namely Mr. Joe Abi-Aad for his understanding and generous support.
Table of Contents
Table of Figures
Introduction
An Information Filtering system is a system that removes redundant or unwanted information from an information stream using automated or computerized methods prior to presenting the filtered information to a human user depending on his or her interests. Its main goal is the management of the information overload and the raise of the semantic signal-to-noise ratio. To do this, the profile of the user is compared to some features which may originate from the information item (the content-based approach) or from the user’s social environment (the collaborative filtering approach) [01].
Information Filtering (IF) systems are similar to conventional Information Retrieval (IR) systems in that they assist in selecting documents that satisfy users’ information needs. However, certain fundamental differences do exist between IF and IR systems, making IF systems interesting and an independent object of analysis [02].
IR systems are usually designed to enable quick retrieval of information items for relatively short-term needs of a diverse group of users. In contrast, IF systems are commonly personalized to support long-term information needs of a particular user or a group of users with similar needs; they accomplish the goal of personalization by directly or indirectly acquiring information from the user.
C:Usersjoseph.chahineDesktopFiltering versus Retrieval.png
Figure – Filtering versus Retrieval
In IF systems, these long-term information needs are represented as interest profiles, often called user profiles, which are subsequently used for filtering purposes. The interest profiles are maintained beyond a single session and are usually modified based on users’ feedback. Another important difference has to do with the information item source. IR systems usually operate on a relatively static set of documents, whereas IF systems are usually concerned with identifying relevant documents from a continuously changing document stream [03].
Recently, the world of the web has become more social and more real-time. Among a myriad of services, Facebook and Twitter are perhaps the exemplars of a new generation of social, real-time web services, providing a fertile ground for IF research [04].
Novel alternatives to conventional IF approaches are considered by harnessing the richness of the Real-Time Web technologies especially popular micro-blogging services as a source of current and topical news.
BackgroundThe Real-Time Web
The real-time web (RTW) consists of news, opinions, comments, and personal viewpoints, and it provides abbreviated and highly personalized information in real-time. The main popular systems that form, together and with others, today’s Real-Time Web are listed below.
Social Media
According to Kaplan and Haenlein [05], social media is defined as a group of internet-based applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content. The most popular social network is Facebook.
Micro-blogging Services
Micro-blogging is a broadcast medium in the form of blogging. A micro-blog differs from a traditional blog in that its content is typically smaller in both actual and aggregate file size. Micro-blogs allow users to exchange small elements of content such as short sentences, individual images, or video links [06].
Twitter is the best example of a micro-blogging service; it is a real-time, highly social micro-blogging service that allows you to post short messages of 140 characters or less; these messages are called tweets. Unlike social networks like Facebook and LinkedIn, where a connection is bidirectional, Twitter has an asymmetric network infrastructure of "friends" and "followers" [07].
Many people consider Twitter as a real-time news platform since it is a powerful publishing tool.
Real-Time RSS
RSS, or Really Simple Syndication, is a web publishing technology that allows end-users to automatically receive new digital content from the provider. Originally used for text files, RSS is now also used to deliver audio and video content. Various types of RSS readers are used to capture and play the content, and many of them are available free of charge [08]. Real-Time RSS does exactly what RSS does, but much faster [09].
Real-Time Financial Information
Financial market data requirements are rapidly expanding as customers integrate ever more sophisticated analytical and trading applications into their environments. Mission-critical decisions need accurate and timely currency exchange and other contributed data. Applications must accommodate these demands and allow for information to be published and accessed accordingly [10].
There are many financial communication protocols used in the markets for real-time financial data feeds. Examples are FIX (Financial Information eXchange) and MMTP (Market Message Transfer Protocol).
Real-Time Information Filtering
As explained earlier, IF is closely related to IR. There are some differences – one is that in IR, the information base is fairly static and the information need is dynamic, whereas the relation is reversed in filtering [11]; in other words, IF systems do not deal with data available as static collections of documents, but assumes the documents are read from an incoming real-time data stream.
To operate efficiently, IF systems must acquire and maintain accurate knowledge regarding both documents and users. The dynamic nature of users’ interests and the information stream makes the maintenance of such knowledge quite complex. Acquiring correct user interest profiles is difficult, since users may be unsure of their interests and may not wish to invest a great deal of effort in creating such a profile. Acquiring information regarding documents is also difficult, because of the size of the information stream and the computational demands associated with parsing voluminous texts. At any time, new topics may be introduced in the stream, or user’s interests related to topics may change.
Furthermore, sufficiently representative documents may not be available to facilitate a priori analysis or training. Research on filtering, so far, has not clarified to a significant extent how these particular problems associated with users and documents may influence the overall filtering process [03]. Due to the lack of a unifying framework for IF, existing solutions were developed mostly in heuristic or ad-hoc ways [12].
Many constraints are to be considered when developing real-time IF agents:
The system must be relatively compact and should consume only limited resources. It should not, for example, require storing and re-processing previously accessed documents, and consequently must accumulate its contextual information along the way.
The system must run in real time. It must make its suggestions while the user is performing the task for which they are relevant.
The system should develop a user profile, reflecting the access patterns of the particular user, in order to provide personalized recommendations likely to be useful for that user.
The system should be able to use the user profile to produce a context profile when the user is accessing documents, reflecting both the user and the current task [13].
Techniques and Algorithms
There are several techniques and algorithms used in information filtering. The two most used techniques are the content-based filtering technique and the collaborative filtering technique.
In content-based filtering, the task is to select relevant information items from a continuous data stream pushed towards the user, based on knowledge of the user’s interests encoded into a user profile [14].
In collaborative filtering, the task is to predict preferences of an active user given a database of preferences of other users, where the preferences are typically expressed as numerical evaluation scores [15].
Regarding the algorithms, two types are used: the memory-based algorithms and the model-based algorithms.
In memory-based algorithms, the system works directly with users’ data [16]; it eithers queries the database to retrieve information items recommended for other similar users (CF), or to retrieve the items that are highly correlated with the stored user profile (CBF). Memory-based algorithms are particularly used in commercial systems such as BN and Amazon, because they are highly effective and easy to implement [17, 18].
In model-based algorithms, the system uses the user database to learn or estimate a model to make filtering decisions. The design and development of models (such as machine learning, data mining algorithms) can allow the system to learn to recognize complex patterns based on the training data, and then make intelligent predictions for the filtering tasks, based on the learned models. Model-based algorithms, such as Bayesian models, clustering models, and dependency networks, have been investigated to solve the shortcomings of memory-based algorithms [19, 20]. Usually, classification algorithms can be used as filtering models if the user ratings are categorical, and regression models and SVD methods can be used for numerical ratings [21].
Term Weighting Schemes
Term weighting schemes are used to represent documents as vectors in the term space. Such vectorial structures can efficiently be parsed without the loss of vital content and it has been widely tested and is general enough to support other computational requirements of the filtering environment [03].
One of the most used schemes is TF-IDF which is a well-known statistical method that assigns a weight to a term in proportion to the number of its occurrences in the document, and in reverse proportion to the documents in which it occurs at least once [22]. TF-IDF stands for Term Frequency-Inverse Document Frequency and can be used to query a corpus by calculating normalized scores that express the relative importance of terms in the documents [07]. Mathematically speaking, TF-IDF is expressed as the product of the normalized term frequency and the normalized inverse document frequency. The basic form of TF-IDF can be described by equation (1).
(1)
where the term represents the number of times the term appears in the document , is the number of documents in the corpus which contain the term , is the total number of documents in the corpus.
In order to avoid null-valued weights for unexisting terms, and according to the probabilistic model adopted, other formulae can be used to smooth and .
In real time, the number of documents isn’t static; it keeps increasing as time goes by, that’s why TF-IDF doesn’t seem to be useful for this research.
Another popular scheme is the BM25 also called the Okapi model which is a probabilistic function for term weighting. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters [23].
Similarity Computation
In IF, the main goal of similarity computation techniques is to measure the distance between users and/or information items. The two most used similarity distances are the Pearson Correlation and the Cosine-Based Similarity.
Pearson Correlation
A commonly used similarity is Pearson correlation, which measures the extent to which two variables linearly relate with each other [24]. This is one of the first similarity computation techniques proposed and also one of the most popular [25]. In IF, Pearon Correlation is usually used to compute similarity between two users over the ratings of their common items as in the formula:
(2)
where is the set of items rated by both users and , is the rating given to item by user , and is the mean rating given by user .
Another variant of the Pearson Correlation is the Weighted Pearson Correlation. This measure is based on the idea of capturing the confidence that can be placed on a neighbor. The problem is that, if two users have few items in common, any of the before mentioned measures can consider them similar simply if, by chance, the ratings of these few items coincide. As the number of items in common increases, this coincidence is more likely to be due to the fact that the users are actually quite similar and not to pure coincidence. So the confidence in the similarity measure increases with the number of items in common. The Pearson correlation coefficient is weighted by the number of items in common, as in the formula:
(3)
where designates the normal Pearson correlation coefficient, and is a threshold determined experimentally, beyong which the correlation measure can be trusted [25].
Cosine-Based Similarity
Alternatively to the Pearson Correlation, one can treat two vectors in a finite dimensional space and compute the similarity based on the cosine angle between them, given by:
(4)
In IF, the compared vectors usually contain the item ratings of the two users and .
Adjusted Cosine Measure is a variant of the Vector Cosine-Based Similarity, and it obtains far better results than the latter in IF, specifically in user-based CF.
(5)
Where and are the means of all ratings of user and respectively.
Note that the main difference between the Pearson Correlation and the Adjusted Cosine Measure is that the means used in (2) are the means of the co-rated items, while the means used in (5) are the means of all items rated by each user.
Weighted Cosine Similarity is also another variant of the Vector Cosine-Based Similarity, and it is used when the dimensions of the finite space are weighted.
(6)
where is the weight assigned to dimension in the space.
Content-Based Filtering
The first approximations to information filtering were based on content. Such systems select which items to recommend based on their content. Therefore, the user profile is a representation of the content in which the user is interested. This kind of filtering is especially effective when retrieving text documents, where each document is represented by a set of keywords [25].
One of the limitations when using CBF is that no new topics are explored; only those that are similar to topics already in the user profile. This leads to over-specialization: one is restricted to seeing items similar to those that have already been rated highly. This has been addressed in some cases with the injection of randomness. Content-based techniques, moreover, are difficult to apply to situations where the desirability of an item, for example a web page, is determined in part by multimedia content or aesthetic qualities. These types of materials are generally incompatible with the type of content analysis that these techniques require in order to make further recommendations [26].
In addition, recommender systems based on CBF frequently require explicit feedback to assess the relevance of their suggestions. Users often avoid rating the items of interest to generate feedback to the system, and the fewer the ratings, the more limited the filtering is [26].
In case-based reasoning, the technique assumes that if a user likes a certain item, he or she will probably also like similar items. The algorithm will recommend new but similar items. On the other hand, the attribute-based technique recommends items based on the matching of their attributes with the user profile. Attributes could be weighted based on their importance to the user [27].
The following table, adapted from [27], shows the main advantages and disadvantages of the case-based reasoning against the attribute-based technique.
AdvantagesDisadvantagesCase-Based
No content analysis
Domain-independent
Improved Quality
New user problem
Over-specialization
Sparsity
Cold-start problem
Attribute-Based
No cold-start problem
No new user/item problem
Sensitive to changes of preferences
Can include non-item related features
Can map from user needs to items
Does not learn
Only works with categories
Ontology modeling and maintenance is required
Over-specialization
Neighborhood-Based Algorithms
Neighborhood-based algorithms are frequently used to compute the similarity between users or information items and use these similarities for prediction purposes. Usually, the choice of the similarity measure used for evaluation of neighborhood relationships is crucial for the success of such approaches [28].
Such algorithms, also called nearest-neighbor algorithms, are used in both CBF and CF. In CBF, they are used in case-based reasoning to find the nearest items. In CF, they are used to also find the nearest users.
Neighborhood-based algorithms are effective when it is difficult to provide a general model that can be adapted to represent different concepts of similarity, especially when the number of available cases may be too small to estimate the optimal set of parameters for such a general model. In such cases, it is almost impossible to produce a highlevel generalization of a "class" of objects. [29].
Unfortunately, neighborhood-based algorithms requires computations that grow linearly with the number of user and items [30].
Top-N Recommendations Algorithms
The main goal of top-N recommendation systems is to identify a set of N information items that are the best candidates to interest the user [30].
User-Based Top-N Recommendation Algorithms
This approach, also known as user-based collaborative filtering, is considered the most successful approach for recommendation [55]. It filters information items based on similar users’ feedback about the items. Similarity between users is usually calculated using the Pearson Correlation or the Cosine Similarity [56]. In general, the top-N recommended items for a certain user are computed as follows:
Identify the most similar users to the current user.
Aggregate the weights of all information items rated, whether directly or indirectly, by these users.
Sort the aggregated items by weight and recommend the top items that were not previously recommended to the current user.
Despite the success of user-based top-N recommendation algorithms, they have some limitations related to real-time performance and scalability. It should be noted that the computational cost of such algorithms grows in a linear fashion with the number of users, which in a practical application can grow to millions [55].
Item-Based Top-N Recommendation Algorithms
This approach, also known as model-based top-N recommendation, examines historical information to detect relations between different items, then use these relations to decide what items should be recommended. The primary motivation behind these algoritms is the fact that a user is more likely to be interested in an information item that is similar to the previous items that interested him or her.
Such algorithms produces recommendations very quickly but are likely to require a significant amount of time to build the models. Moreover, their recommendations are generally of inferior quality than those produced by a user-based system [55].
Bayesian Belief Net Algorithms
Bayesian Belief Net (BN) algorithms are model-based algorithms used in CBF and in CF as well. A Bayesian network is a graph in which the following holds:
A set of random variables makes up the nodes of the network.
A set of directed links connects pairs of nodes. The intuitive meaning of an arrow from node X to node Y is that X has a direct influence on Y.
Each node has a conditional probability table (CPT) that quantifies the effects that the parents have on the node. The parents of a node X are all those nodes that have arrows pointing to X.
The graph has no directed cycles, hence is a directed acyclic graph, or DAG [31].
BNs are often used for classification tasks. Motivated by the simplicity and accuracy of the Naïve Bayes (NB) classifier, BNs are increasingly used for pattern recognition, fault diagnosis and other classification tasks [24].
Simple Bayesian Algorithm
The probability of a message being in class , is computed as:
(3)
where is the conditional probability of term occurring in a message of class , is the prior probability of a message occurring in class , and is the number of distinct terms in the message.
can be used to measure how much evidence contributes that is the correct class [32]. The class of a message is determined by finding the most likely or maximum a posteriori (MAP) class defined by:
(4)
where designates a finite set of classes.
Since equation (4) involves a multiplication of many conditional probabilities, one for each feature, the computation can result in a floating point underflow. In practice, the multiplication of probabilities is often converted to an addition of logarithms of probabilities and, therefore, the maximization of the equation is alternatively performed by:
(5)
All model parameters, i.e. class priors and term probability distributions, can be estimated with relative frequencies from a training set . Note that when a class and message term do not occur together in the training set, the corresponding frequency-based probability estimate will be zero, which would make the right hand side of equation (5) undefined. This problem can be mitigated by incorporating some correction such as Laplace smoothing in all probability estimates.
NB is a simple probability learning model and can be implemented very efficiently with a linear complexity. It applies a simplistic or naïve assumption that the presence or absence of a term in a class is completely independent of any other terms. Despite the fact that this oversimplified assumption is often inaccurate (in particular for text domain problems), NB is one of the most widely used classifiers and possesses several properties that make it surprisingly useful and accurate [33].
Because most real-world data are multiclass ones, the simple Bayesian algorithm has been applied to multiclass data for filtering tasks, and was found to have worse predictive accuracy but better scalability than the Pearson correlation-based filtering as it makes predictions based on observed ratings, and the prediction making process is less time-consuming. Although it is considered as a model-based algorithm, the simple Bayesian algorithm can be also regarded as memory-based algorithm because of its in-memory calculation for predictions [21].
C:Usersjoseph.chahineDesktop1000px-SimpleBayesNet.svg.png
Figure – Simple Bayes
NB-ELR and TAN-ELR Algorithms
Because of the limitations of the simple Bayesian algorithm for filtering tasks, advanced BNs filtering algorithms, with their ability to deal with incomplete data, can be used instead [24]. In fact, every attribute (every leaf in the network) is independent from the rest of the attributes, given the state of the class variable (the root in the network) [34].
Tree Augmented Naive Bayes (TAN) is a classification algorithm that allows additional edges between attributes that capture correlations among them. It approximates the interactions between attributes by using a tree structure imposed on the naïve Bayesian structure. Below is a sketch of a TAN model learned from a medical dataset. The dashed lines are those required by the naïve Bayesian classifier The solid lines are correlation edges between attributes [34].
Figure – Tree Augmented Naive Bayes model
Extended Logistic Regression (ELR) is a gradient-ascent algorithm [35, 36], which is a discriminative parameter-learning algorithm that maximizes conditional log-likelihood.
TAN-ELR (TAN optimized by ELR) and NB-ELR (Naïve Bayes optimized by ELR) have been proven to have high classification accuracy for both complete and incomplete data [35, 36].
Applied to CF tasks, working on real-world multiclass CF datasets and using the mean absolute error as evaluation criterion, the empirical results show that the TAN-ELR CF and NB-ELR CF algorithms perform significantly better than the simple Bayesian CF algorithm, and consistently better than the Pearson correlation memory-based CF algorithm [24].
However, TAN-ELR and NB-ELR need a longer time to train the models. A solution is to run the time-consuming training stage offline, and the online prediction-producing stage will take a much shorter time [21].
Clustering Algorithms
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in the other clusters. A cluster of data objects can be treated collectively as one group in many applications [37].
The measurement of the similarity between objects is determined using metrics such as Minkowski distance and Pearson correlation. For two data objects, and , the popular Minkowski distance is defined as:
(4)
where is the dimension number of the object and , are the values of the ith dimension of object and respectively, and is a positive integer. When , is called Manhattan distance; when , is called Euclidian distance [21].
In general, major clustering methods can be classified in five categories as follows:
Partitioning Methods
Given a collection of objects, a partitioning method constructs partitions of the data, where each partition represents a cluster, and . That is, it classifies the data into groups, which together satisfy the following requirements: (1) each group must contain at least one object, and (2) each object must belong to exactly one group.
Given , the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation technique which attempts to improve the partitioning by moving objects from one group to another. The general criterion of a good partitioning is that objects in the same cluster are close or related to each other, whereas objects of different clusters are far apart or very different.
Instead of applying exhaustive enumeration on all possible partitions, most applications adopt one of two popular heuristic methods: (1) the k-means algorithm, where each cluster is represented by the mean value if the objects in the cluster and (2) the k-medoïds algorithm, where each cluster is represented by one of the objects located near the center of the cluster [37].
Figure – Partition Clustering
Hierarchical Methods
Hierarchical clustering is one of the most frequently used methods in unsupervised learning. Given a set of data points, hierarchical clustering outputs a binary tree (dendrogram) whose leaves are the data points and whose internal nodes represent nested clusters of various sizes. The tree organizes these clusters hierarchically, where the hope is that this hierarchy agrees with the intuitive organization of real-world data. Hierarchical structures are ubiquitous in the natural world. For example, the evolutionary tree of living organisms (and consequently features of these organisms such as the sequences of homologous genes) is a natural hierarchy. Hierarchical structures are also a natural representation for data which was not generated by evolutionary processes. For example, internet newsgroups, emails, or documents from a newswire, can be organized in increasingly broad topic domains [38].
Based on how the hierarchical decomposition is designed, a hierarchical method can be categorized as being either divisive or agglomerative. The divisive approach also called the "top-down" approach, starts with all the objects in the same cluster, or until a termination condition holds. The agglomerative approach, also called the "bottom-up" approach, starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or until a termination condition holds. [37]
There are several limitations to hierarchical clustering. The algorithm provides no guide to choosing the "correct" number of clusters or the level at which to prune the tree. It is often difficult to know which distance metric to choose, especially for structured data such as images or sequences. The traditional algorithm does not define a probabilistic model of the data, so it is hard to ask how "good" a clustering is, to compare to other models, to make predictions and cluster new data into an existing hierarchy [38]. One additional limitation is the fact that once a step is done, whether a merge or a split, it can never be undone. This rigidity of the hierarchical method is both the key to its success because it leads to smaller computation cost without worrying about a combinational number of different choices, as well as the key to its main problem because it cannot correct erroneous decisions [37].
Figure – Hierarchical Clustering
Density-based methods
Most partitionning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Clustering methods have been developed based on the notion of density. The general idea is to continue growing the given cluster so long as the density (number of objects of data points) in the "neighborhood" exceeds some threshold, i.e., for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers), and discover clusters of arbitrary shape.
C:Usersjoseph.chahineDesktopuniformity_clustering_icpr2008_2.png
Figure – Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm. It finds a number of clusters starting from the estimated density distribution of corresponding nodes. It is one of the most common clustering algorithms and also most cited in scientific literature [39].
OPTICS (Ordering Points To Identify the Clustering Structure) is another algorithm for finding density-based clusters in spatial data. Its basic idea is similar to DBSCAN, but it addresses one of DBSCAN’s major weaknesses: the problem of detecting meaningful clusters in data of varying density. In order to do so, the points of the database are (linearily) ordered such that points which are spatially closest become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster. This is represented as a dendrogram [40].
Grid-based methods
A grid-based clustering method divides the data points space into a finite number of cells forming a grid structure, also called the quantized space. It then performs clustering operations on the grid structure. The main advantage of this method is its fast processing time which only depends on the number of cells in the new space and doesn’t usually dependent of the number of data points to be clustered [37]. Since the grid can be multi-dimensional, this method experiences a problem of dimensionality: the more dimensions are added, the number of cells grows exponentially [57].
Grid-based clustering algorithms can be optimized to run in real-time by determining the number of the clusters according to the point density of the grid cell and by computing the distances between the centers of the clusters and the grid cells, not the data points [58].
Model-based methods
Model-based clustering is a statistical approach to data clustering. The observed (multivariate) data is assumed to have been generated from a finite mixture of component models. Each component model is a probability distribution, typically a parametric multivariate distribution. Such methods have found applications in several domains such as text clustering, image processing, computational biology, and climate sciences [59].
Regression-Based Algorithms…Latent Semantic Models
Latent semantic models are well-known techniques for IR and IF. The approach is to take advantage of implicit higher-order (or latent) structure in the association of terms with documents (semantic structure) in order to improve the detection of relevant documents on the basis of terms found in queries [41].
Latent semantic models are used to solve the problems that arose when using the vector space model, especially the synonymy and the polysemy challenges. Synonymy is faced when there are many ways to refer to the same object, for example, car and automobile. Synonymy is know to lead to poor recall. On the other hand, polysemy is faced when a word has more than one distinct meaning, for example, model. Polysemy is know to lead to poor precision.
Latent Semantic Indexing
Latent Semantic Indexing (LSI) [41] is a dimensionality reduction technique developed in IR in order to address the problems deriving from the use of synonymous, near-synonymous, and polysemous words as dimensions of document representations. This technique compresses document vectors into vectors of a lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of cooccurrence. In practice, LSI infers the dependence among the original terms from a corpus and "wires" this dependence into the newly obtained, independent dimensions. The function mapping original vectors into new vectors is obtained by applying a Singular Value Decomposition (SVD) to the matrix formed by the original document vectors [42].
SVD is a mathematical technique used to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts [43].
One characteristic of LSI is that the newly obtained dimensions are not intuitively interpretable. However, they work well in bringing out the "latent" semantic structure of the vocabulary used in the corpus. [42]
Probabilistic Latent Semantic Indexing
To substantiate the claims regarding LSI, and to study its relative strengths and weaknesses, it is useful to develop a generative probabilistic model of text corpora and to study the ability of LSI to recover aspects of the generative model from data [44]. Given a generative model of text, however, it is not clear why one should adopt the LSI methodology — one can attempt to proceed more directly,fitting the model to data using maximum likelihood or Bayesian methods.
A significant step forward in this regard was made by Hofmann [45], who presented the probabilistic LSI (pLSI) model, also known as the aspect model, as an alternative to LSI. The pLSI approach models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of topics. Thus each word is generated from a single topic, and different words in a document may be generated from different topics. Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on a fixed set of topics. This distribution is the reduced description associated with the document [46].
Latent Dirichlet Allocation
While Hofmann’s work is a useful step toward probabilistic modeling of text, it is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. This leads to several problems: (1) the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting, and (2) it is not clear how to assign probability to a document outside of the training set [46]. To solve the incompleteness issue, Latent Dirichlet Allocation (LDA) is used because it follows a full generation process for document collection [46].
Latent Dirichlet Allocation (LDA) is a generative probabilistic model for analyzing collections of documents. It can also be applied to other types of data, although the terminology of natural language processing is often used in an intuitive way to describe the method. LDA is based on a Bayesian model where each document is modeled as a mixture of topics, and each topic is a discrete probability distribution (in other words, an array) that defines how likely each word is to appear in a given topic. These topic probabilities give a concise representation of a document. Here, a document is a "bag of words" with no structure beyond the topic and word statistics.
Algorithms such as LDA have made great strides in modeling the semantic relationships between words based on their co-occurrences in documents. The learning algorithm assumes that each document is basically a mixture of latent topics. Each topic is a probability distribution over words; in other words, each topic is an array with one slot for each word in the lexicon, where the elements sum to 1. We imagine that each word in a document was generated by repeatedly selecting a topic at random from the topic mixture and then selecting a word at random from the topic. Given these parameters, the topics of all words in the same document are assumed to be independent and thus the "bag of words" description [47].
…
Each document is represented as a collection of words. The words are considered exchangeable and conditionally independent with respect to their underlying distribution.
Online Latent Dirichlet Allocation…Multinomial Model (mixture model)…MDP-Based AlgorithmsUser Profiling
On the web today, sites are still unable to market each individual user in a way, which truly matches their interests and needs. Sites make broad generalizations based on huge volumes of data that don’t accurate an individual person. Since people display regularities in almost everything they do, it would be far more useful if web sites are able to understand specific users’ interests and provide them with the contents that are relevant to them. Instead of requiring users to provide their interests, a website could learn the type of content that interests the users [48].
User profiles for information filtering systems can be generated manually by users or automatically by a system. Previous research has shown that users fail to define their information needs accurately [49].
Common StepsCreating the Initial Profile
Traditional personalized search approaches rely solely on individual profiles to construct a user model. They are often confronted by two major problems: data sparseness and cold-start for new individuals. Data sparseness refers to the fact that most users only visit a small portion of Web pages and hence a very sparse user-term relationship matrix is generated, while cold-start for new individuals means that the system cannot conduct any personalization without previous browsing history. Recently, community-based approaches were proposed to use the group’s social behaviors as a supplement to personalization. However, these approaches only consider the commonality of a group of users and still cannot satisfy the diverse information needs of different users [50].
Keeping the Profile Up-To-Date
As the preferences of users change with time, their profiles will need to be updated. The most direct way to know the users’ interests in an information item is to require them to explicitly classify it with an interest value on a certain scale. However, eliciting an explicit classification for all information items a user reads is both intrusive and time-consuming. Classification tasks tend to cause undue cognitive loads and users tend to refrain from doing them. Furthermore, an explicit classification scheme would require awareness from the users that their interests are changing and there are no guarantees that the classification criteria will remain the same for all interactions with the system. Relying solely on the users’ actions to keep their profiles up-to-date is not a good solution [51].
Main ModelsUser-Created Profiles
Based on explicit user feedback [52], this is the simplest and most natural approach. The user specifies his or her areas of interest by a list of keywords or terms. The specified terms are then used to guide the filtering process [22]. This approach where the user explicitly defines his interests is mostly used in content-based filtering (CBF).
For instance, in Web-based social networks such as MySpace and YouTube, the user has to enter the profile by herself/himself. Unfortunately, the information obtained solely from the user entering a profile is sometimes incomplete or inconsistent. Users do not fill in some information merely because they are not willing to fill it in [53].
System-Created Profiles
Based on implicit user feedback [54] or Hidden Markov Models
…State-of-the-Art ExperimentsScalable Clustering of News Search Results…Mutually Beneficial Learning with Application to On-line News Classification…Tag-based Object Similarity Computation Using Term Space Dimension Reduction…Real-Time Automatic Tag-Based RecommendationLearning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data CollectionsDensity-Based Clustering for Real-Time Stream DataUsing Twitter to Recommend Real-Time Topical NewsFuture Work
Hybrid System: CBF + CF
CBF uses OLDA and Public Twitter Feed (perhaps retrieved via keywords related to topics)
Collect a large dataset for each classification task (business domain) to train LDA
CF uses Quick SVD (item-based or user-based)
The recommendation is based on a weighted sum of (a) and (b). If the user generates a feedback, an Artificial Neural Network returns the actual feedback rating to the input of the ANN and adjusts the weights of (a) and of (b)
Abbreviations and Acronyms
Below is a list of abbreviations and acronyms appearing in this report. The list is ordered alphabetically.
BN
Belief Net
CF
Collaborative Filtering
CBF
Content-Based Filtering
CPT
Conditional Probability Table
DAG
Directed Acyclic Gragh
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
ELR
Extended Logistic Regression
FIX
Financial Information eXchange
IDF
Inverse Document Frequency
IF
Information Filtering
IR
Information Retrieval
LDA
Latent Dirichlet Allocation
LSI
Latent Semantic Indexing
MAP
Maximum a Posteriori
MMTP
Market Message Transfer Protocol
NB
Naïve Bayes
OLDA
Online Latent Dirichlet Allocation
OPTICS
Ordering Points To Identify the Clustering Structure
pLSI
Probabilistic Latent Semantic Indexing
RSS
Really Simple Syndication
SVD
Singular Value Decomposition
TAN
Tree Augmented Naïve Bayes
TF
Term Frequency
UPS
Université Paul Sabatier
Article name: Real Time Information Filtering Computer Science essay, research paper, dissertation
Make Assignments Great Again
24/7 customer support: science/80379-real-time-information-filtering-computer-science.html