- Create the IAL.
- Create a Quick Information Source (QIS) utilising the IAL.
- Create the Lobby Data Source with the newly generated QIS view.
Use higilighted view in the datasource designer
By finding the lexical similarities between sentences we found the total overlaps between vocabularies of the sentences. So the jaccard similarity score between sentences is calculated which is based on the total overlaps between words. The overlaps between words were measured based on the overlaps between word tokens, word stems and word lemmas and final lexical similarity was gained as an average of these three values. If the lexical similarity is greater than a threshold value defined as 0.7, the sentences were considered as lexically similar and thereby remove the redundant sentences.
Syntactic similarity between sentences identify sentences which have the same syntactic relationships. To find the syntactic relationships between the sentences, 2gram models are modelled for the sentence pair and the dice coefficient is computed between them. If the dice coefficient is a non-zero value, it’s decided that the two sentences have some syntactic relationship But the major concern here is although the sentences are syntactically similar the meaning of the sentences may be different. So the semantic similarity between sentences also needed to be found.
The algorithm to find the semantic similarities between sentences used two main methods; wordNet based semantic similarity and word2vec based semantic similarity. The implementation details of each of the methods are described below.
This approach computes the level of semantic similarity between sentences based on the synsets given by the wordNet lexical dictionary for each word in the sentences. It first assigns WordNet part of speech tags; i.e. either noun, verb, adjective or adverb. Then WordNet based synsets are assigned for each tagged word in the sentence pair. Then it compares pair by pair synsets and computes the path distance between the synsets. Finally all the path distances are accumulated together to find the final semantic similarity between sentences based on wordNet.
implementation purposes, we used Google News’s pre-trained neural network trained by using its data set which contains around about 100 billion words. Based on this pretrained neural network, word embeddings are modelled for each word in the sentence pair. Then considering pairs of word embeddings, the cosine distance between those word embeddings are calculated. Finally all these cosine distances are summed up to find the word2vec based semantic similarity between the sentences
Finally the overall semantic similarity is measured by taking the average value of two similarity scores taken from the wordNet based method and word2vec based method. If that final semantic similarity score is greater than the threshold value defined as 0.7, it’s proven that those sentences are semantically similar and thereby remove those redundant sentences.
After removing the redundant sentences from the intermediate level of summary, it’s very important to arrange the final summary sentences in the proper coherent order. Otherwise it will reduce the readability. So, the sentence ordering was performed by sequence matching. The sequence ratio is computed which is kind of a coherence score in both the directions for a given sentence pair and the sentences are arranged in the order which preserves the highest sequence ratio between the sentences. The SequenceMatcher from the difflib library was used for sequence matching. Then the final summary is displayed to the user after arranging the summary sentences in the proper coherent order.
The implementation details of the hybrid recommendation module are described below. The individual implementation details of each individual recommendation method i.e. the implementation details of content based filtering, collaborative filtering, popularity model are also discussed.
A common (and usually hard-to-beat) baseline approach is the Popularity model. This model is not actually personalized – it simply recommends a user the most popular items that the user has not previously consumed. As the popularity accounts for the “wisdom of the crowds”, it usually provides good recommendations, generally interesting for most of the people. The main objective of a recommender system is to leverage the long-tail items to the users with very specific interests, which goes far beyond this simple.
Content-based filtering approaches leverage description or attributes from items the user has interacted to recommend similar items. It depends only on the user’s previous choices, making this method robust to avoid the cold-start problem. It is simple to use the raw text to build item profiles and user profiles. Here we are using a very popular technique in information retrieval (search engines) named TF-IDF. This technique converts unstructured text into a vector structure, where each word is represented by a position in the vector, and the value measures how relevant a given word is for an article. As all items will be represented in the same Vector Space Model it is easy to compute similarity between articles.
To model the user profile, we take all the news profiles the user has interacted and average them. The average is weighted by the interaction strength, in order words, the articles the user has interacted the most (e.g. Liked or commented) will have a higher strength in the final user profile.
Memory-based: This approach uses the memory of previous users interactions to compute users similarities based on items they’ve interacted (user-based approach) or compute items similarities based on the users that have interacted with them (item-based approach). A typical example of this approach is User Neighborhood-based CF, in which the top-N similar users are selected and used to recommend items those similar users liked, but the current user have not interacted yet. § Model-based: In this approach, models are developed using different machine learning algorithms to recommend items to users. There are many model-based CF algorithms, like neural networks, Bayesian networks, clustering models, and latent factor models such as Singular Value Decomposition (SVD) and, probabilistic latent semantic analysis.
In our approach, we combine Collaborative Filtering and Content-based Filtering algorithms. Then it’ll provide more accurate recommendations. In fact, hybrid methods have performed better than individual approaches in many studies and have being extensively used by researches and practice works. We used in our hybridization method, by only multiply the CF score with the Content-based score and ranking by resulting score.
This chapter provides the overview of the implementation of the project. We have stated how the project was built step by step and approaches we followed to accomplish them. The module wise implementations are further described in this chapter.
This chapter focusses on the results obtained through implementation of the algorithms we have proposed. Results obtained through experimentation is summarized and analyzed through discussions. These discussions are used for deriving conclusion on the algorithms and approaches we have used for achieving the aim and objectives.
Used BBC news dataset. It consists 2225 e-news articles from the BBC e-news website corresponding to e-news in five main areas from 2004 to 2005. Class labels are political, business, Entertainment, technology and sports. There are 510 business news, 386 entertainment news, 414 political news, 511 sports news and 401 technological news in the dataset.
In testing data 115 news items are originally labeled as business news, 110 of which were classified as business, 1 as entertainment, 3 as political and 1 as tech. 72 news items are originally labeled as entertainment, 71 of which classified as entertainment and 1 as political news.76 news items are originally labeled as political news, 73 of which were classified as political, 2 as business and 1 as tech. 102 items are originally labeled as sports, 101 of which were classified as sport news and 1 as business news. 80 news items are originally classified as tech. 78 of which were classified as tech, 1 as business and 1 as entertainment.
To select the best kernel function for this domain, calculated the accuracy of the system with different kernel functions. Blow table shows the results of different kernel function. Kernel functions are applied for non-linearly separable domains to map into higher dimension spaces which can easily separable.
RandomForestClassifier, MultinomialNB, and Support Vector Machine gives higher accuracy. So we used those three classifiers to create ensemble classifier. Majority rule voting is used in hard Voting Classifier. The soft voting method predicts the e-news class label based on the sum of the predicted probabilities of individual classifies. The soft voting method gives more accurate result than hard voting method. So we used the soft voting method to implement our system.
The final results show higher average recall, precision and f1 results for ensemble classier over other three individual classifiers.
0 Comments