• Stop Oracle service Using cmd 'net STOP service_name'
• Uninstall all Oracle components using the Oracle Universal Installer (OUI).
• Run regedit.exe and delete the HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE key. This contains registry entires for all Oracle products.
• Delete any references to Oracle services left behind in the following part of the registry: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Ora* It should be pretty obvious which ones relate to Oracle.
• Reboot your machine.
• Delete the "C:\Oracle" directory, or whatever directory is your ORACLE_BASE.
• Delete the "C:\Program Files\Oracle" directory.
• Empty the contents of your "C:\temp" directory.
• Empty your recycle bin.
----
In this chapter we have described the implementations which we have done according to the analysis and design described in previous chapters. How technology is used to implement the solution is described further in this chapter. This chapter provides details of implementation of each module that is stated in the design section step wise.
The implementation is done using python programming language. In extraction e-news articles should be extracted from different e-news web portals. At the beginning the system needs URLs or RSS feeds of predefined web sites that. As a first step, seed URLs and RSS feeds will be given to the system as a JSON file. The format of the JSON file shows below. This format helps to add or remove new website easily.
Iterate through JSON file and checking weather rss feed is provides or not. If it is available, use FeedPaser to load RSS feeds. Then build the structure for the data by constructing a dictionary newsPaper. if 'rss' in value:
List of links to e-news articles taken from the RSS-feed is the variable d. It will loop through for each entry. Check publish date field to get consistent data. If publish date field is not available the entry will be discarded. An article dictionary is created to store data for every e-news item.
The implementation of e-news aggregation can be divided into three parts. Those are preprocessing, features extraction and clustering. Python nltk, sklearn, gensim packages have used for implementing the e-news aggregation module.
For features extraction, three different feature models were implemented. Those are LDA model, Doc2vec model and Tf-idf model. But Tf-idf feature model gave higher accuracy than other two models. LDA model was implemented using gensim python library.
For clustering, three different clustering algorithms were implemented. Those are Kmeans algorithm, Affinity propagation algorithm and DBSCAN algorithm. But DBSCAN algorithm gave higher accuracy than other two algorithms.
In the summary generation process the most salient sentences are extracted by assigning a sentence score to each and every sentence of the input documents. A hybrid approach is applied to assign sentence scores; a feature based approach and a graph based approach combined together as described in the previous chapters.
The TextRank algorithm which is one of the famous graph based approaches was implemented here. It first constructs a sentence similarity graph by considering the sentence similarities. After evaluating various similarity measures described in earlier chapters, the cosine similarity measure was used for the implementation purposes since it showed higher accuracy rate. Before computing the cosine similarities between sentences we need to generate sentence vectors. Sentence vectors are generated by computing feature weights (e.g.: Term frequency).
Then the cosine similarity is calculated for the pairs of sentence vectors which represent the sentences. This gives a measure to predict how much each sentence in the document is similar to other sentences in the document. It calculates the cosine similarity as ‘1’ if the sentences are identically similar to each other and as ‘-1’ when the sentences are exactly opposite of each other or else any value between -1 and +1 in all the other cases based on the similarities between them. In other terms the sentences are said to be similar if the cosine distance is ‘0’. Therefore when the cosine distance is less, it implies that the similarity between the sentences is high. I.e. Cosine similarity is equal to (1- cosine distance).
Then a sentence similarity matrix is generated based on the similarities between the sentences and it needs to be converted to a graph to apply the PageRank algorithm to assign sentence scores. When building the sentence similarity matrix the values need to be normalized with TF-IDF to remove noise from the stop words.
The ultimate result of this approach is returning the PageRank scores for each sentence in the original set of documents. The inbuilt pagerank function from the networkx library was used to compute pageRank scores for each sentence.
The feature based method assigns a weighted average score for each sentence based on the availability of a set of pre-defined features in the sentences. These features include the sentence position, sentence length, availability of title words, named entities count, nouns count, verbs count, numerical literals count, key word frequencies etc.
The sentence position feature assigns each sentence a score based on the position of each sentence in the original document. So it assigns high scores for the sentences which are at the first place of the document or the introductory sentences and the sentences which are at the end of the document or the concluding sentences.
• Stop Oracle service Using cmd 'net STOP service_name'
• Uninstall all Oracle components using the Oracle Universal Installer (OUI).
• Run regedit.exe and delete the HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE key. This contains registry entires for all Oracle products.
• Delete any references to Oracle services left behind in the following part of the registry: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Ora* It should be pretty obvious which ones relate to Oracle.
• Reboot your machine.
• Delete the "C:\Oracle" directory, or whatever directory is your ORACLE_BASE.
• Delete the "C:\Program Files\Oracle" directory.
• Empty the contents of your "C:\temp" directory.
• Empty your recycle bin.
----
In this chapter we have described the implementations which we have done according to the analysis and design described in previous chapters. How technology is used to implement the solution is described further in this chapter. This chapter provides details of implementation of each module that is stated in the design section step wise.
The implementation is done using python programming language. In extraction e-news articles should be extracted from different e-news web portals. At the beginning the system needs URLs or RSS feeds of predefined web sites that. As a first step, seed URLs and RSS feeds will be given to the system as a JSON file. The format of the JSON file shows below. This format helps to add or remove new website easily.
Iterate through JSON file and checking weather rss feed is provides or not. If it is available, use FeedPaser to load RSS feeds. Then build the structure for the data by constructing a dictionary newsPaper. if 'rss' in value:
List of links to e-news articles taken from the RSS-feed is the variable d. It will loop through for each entry. Check publish date field to get consistent data. If publish date field is not available the entry will be discarded. An article dictionary is created to store data for every e-news item.
The implementation of e-news aggregation can be divided into three parts. Those are preprocessing, features extraction and clustering. Python nltk, sklearn, gensim packages have used for implementing the e-news aggregation module.
For features extraction, three different feature models were implemented. Those are LDA model, Doc2vec model and Tf-idf model. But Tf-idf feature model gave higher accuracy than other two models. LDA model was implemented using gensim python library.
For clustering, three different clustering algorithms were implemented. Those are Kmeans algorithm, Affinity propagation algorithm and DBSCAN algorithm. But DBSCAN algorithm gave higher accuracy than other two algorithms.
In the summary generation process the most salient sentences are extracted by assigning a sentence score to each and every sentence of the input documents. A hybrid approach is applied to assign sentence scores; a feature based approach and a graph based approach combined together as described in the previous chapters.
The TextRank algorithm which is one of the famous graph based approaches was implemented here. It first constructs a sentence similarity graph by considering the sentence similarities. After evaluating various similarity measures described in earlier chapters, the cosine similarity measure was used for the implementation purposes since it showed higher accuracy rate. Before computing the cosine similarities between sentences we need to generate sentence vectors. Sentence vectors are generated by computing feature weights (e.g.: Term frequency).
Then the cosine similarity is calculated for the pairs of sentence vectors which represent the sentences. This gives a measure to predict how much each sentence in the document is similar to other sentences in the document. It calculates the cosine similarity as ‘1’ if the sentences are identically similar to each other and as ‘-1’ when the sentences are exactly opposite of each other or else any value between -1 and +1 in all the other cases based on the similarities between them. In other terms the sentences are said to be similar if the cosine distance is ‘0’. Therefore when the cosine distance is less, it implies that the similarity between the sentences is high. I.e. Cosine similarity is equal to (1- cosine distance).
Then a sentence similarity matrix is generated based on the similarities between the sentences and it needs to be converted to a graph to apply the PageRank algorithm to assign sentence scores. When building the sentence similarity matrix the values need to be normalized with TF-IDF to remove noise from the stop words.
The ultimate result of this approach is returning the PageRank scores for each sentence in the original set of documents. The inbuilt pagerank function from the networkx library was used to compute pageRank scores for each sentence.
The feature based method assigns a weighted average score for each sentence based on the availability of a set of pre-defined features in the sentences. These features include the sentence position, sentence length, availability of title words, named entities count, nouns count, verbs count, numerical literals count, key word frequencies etc.
The sentence position feature assigns each sentence a score based on the position of each sentence in the original document. So it assigns high scores for the sentences which are at the first place of the document or the introductory sentences and the sentences which are at the end of the document or the concluding sentences.
Based on the number of numerical literals, nouns, verbs, named entities in each sentence a score is assigned to them. These feature scores are normalized by sigmoid function in order to get feature scores ranging from 0 to 1 and thereby having an evenly distributed contribution.
The thematic word feature assign sentence scores based on the term frequencies which are the word counts of each word in the sentence and thereby identifying the most frequent words which are referred to as keywords of the text. Before identifying the keywords of the document, the stop words which do not have any semantic meaning need to be removed. Because these stop words are more frequently appearing in the documents and they should not be identified as key words. So prior to the identification of key words these stop words need to be eliminated.
Then based on the individual scores obtained for each feature the final aggregated score is calculated for each of the sentences. The weights are assigned for the features in the feature vector based on their relative importance when generating the summary. It has proven that the title score and the key word frequencies have a higher weightage or importance in the generation of summary and those feature are assigned higher weights. Then final sentence scores are calculated as a weighted average score of sum of the products of individual feature scores and their weights. Thereby the final sentence scores of each sentence in the original documents are returned by the function.
Then the final sentence scores are calculated as an average value of scores taken from the above two approaches. Then the system selects the top ranked sentences to form individual summaries for the news articles in the cluster. The number of top ranked sentences depends on a compression rate which is usually selected as 30% of the original text. Finally those individual summaries are compiled together to form an intermediate level of summary.
After the intermediate level of summary is created by combining the individual summaries, there are some post processing tasks need to be applied to make the final summary a more readable and coherent one. The redundancy removal is such a very important post processing task needs to be performed. Since we are taking the top ranked sentences from each e-news article in the cluster and aggregate them to form the final summary, there may be redundant sentences in the final summary. The reason for that is different e-news articles may use different words and phrases to describe about the same thing. Redundancies of the sentences are removed by identifying similar sentences in three perspectives namely the syntactic similarity, lexical similarity and semantic similarity.
0 Comments