Intelligent Human Computer Interaction-Artificial Intelligence

An Intelligent Automatic Text Summarizer

HUMAN COMPUTER

Abstract

An intelligent text summarizer that summarizes a given piece of text into three different summaries based on three different algorithms. This summarizer uses statistical methods to summarize a text like considering the frequency of words, rare words etc. It then gives a meaningful title to the main text and finally selects the best summary out of a list of given summaries. This summarizer allots the writer a competence level (in written English) after analyzing the text like number of rare words used. The title generator of the summarizer gives a short title to the main text. Results obtained through experiments showed that it is indeed possible to determine the competence level of the writer from the text and proximity of the sentences play a vital role in selecting the best summary.

1 Introduction

A summary is a text that is produced from one or more texts, that contain a significant portion of information in the original text, and that is no longer than half the original text . Summarization is used extensively in generating search engine query results and in generating the automated abstracts of research papers. Summarization plays an important role in categorizing the ever-growing extensive collection of web pages that is present in the web today . Title extraction is an important and the most challenging task for any summarizer. The most difficult part for any summarizer is choosing which sentences are important and need to be selected for final summarization. This summarizer has been developed keeping in mind that there is no loss of information during summarization phase.

2 Related Work

This work is a part of the ongoing research called The Thinking Algorithm ,where the indexer does text summarization.There are two ways to view text summarization either as text extraction or as text abstraction. This summarizer draws much inspiration from H.P. Luhn’s work of text summarization. What differentiates this work with the other works is that most of the works cited in this paper generate one summary, which is claimed to be the best one.

This summarizer generates three different summaries. Moreover, the algorithms and heuristics applied in the summarizer that select the final summary take into account the proximity of the sentences in the main text compared to that generated in the summarized . There are three steps to perform text summarization. First, understand the topic of a text ; second the interpretation of the text and finally the generation of the text. The generation of text is carried out in two different ways, namely: Extraction and Abstraction.

3. Implementation Details

Five persons were given the task of drafting an essay (of about 500 words) from a topic chosen unanimously. The writers were not equally competent in written English. Following were addressed during implementation:

  • Why is it so that one person uses more rare words in the text as compared to the other person?
  • Why is it so that one person can enumerate several ideas in just one long sentence while the other uses several sentences in order to convey the same idea?
  • How can the competency of the writer be judged from the essay?
  • Why has the writer chosen one particular word in a sentence though other words of the same meaning do exist?
  • How can the title be given to each of the text, using an automatic text
  • summarizer?
  • How important are the Proper Nouns, in the text?
  • How to determine which sentence is important in which case and which one needs to be selected by the automatic text summarizer.

4. Observations and Algorithm Design

Observation 1: The writer with excellent command in English wrote long sentences with selection of an appreciable amount of high-frequency words.
Observation 2: Two writers wrote average English essay, with less use of high frequency words and a mixture of long and short sentences.
Observation 3: The remaining two writers were not so competent in English writing. They wrote essays with an appreciable amount of grammatical mistakes. Sentences written were very short.

It became clear that high-frequency words play an important role in any writing because they convey some message from the writer. A writer uses high-frequency words to express an important message. For example, one of the writers wrote:
“Students must never infringe the rubric of the school.” Automatically, this becomes an important line because it conveys an important aspect, which should always be followed by the students. “infringe” and “rubric” were rare words1 in the whole essay.
No loss of information from the main text was also considered an important part. It would be better if all the sentences of similar category were grouped together. Using a selection algorithm the best group could be chosen based on some rules.
Each word should be allotted some weight based on their occurrence in the main text. Weights can be equaled to the words’ frequency i.e. how many times a word occurs in the text. After several evaluations and experiments, following three algorithms were chosen.

1. Extracting and grouping those sentences that have rare words in them.

2. Extracting those sentences whose sum of frequencies of words in that sentence is below the average value calculated from the entire sum of frequencies of words in the whole document.

3. Extracting those sentences whose sum of frequencies of words in that sentence is above the average value calculated from the entire sum of frequencies of words in the whole document.

4.  Selection algorithm, the algorithm that selects one best summary out of a list of three summaries.

4.1.  Explanation and Proof of Concepts

The sentences with rare words were categorized into one group. However, the problem with this algorithm was observed when it summarized that essay of the writer with excellent command in written English. The algorithm selected more than 90% of the lines of the essay. This violated from the definition of a summary that it should not be longer than half the original text.

Algorithms (2) and (3) stated above work as follows:

Note:Rare Words are those words that occur only once in the entire document.

Let there exist a sentence S in the main text. Let there be N number of words in the sentence with frequencies f1, f2, f3…fN.. This frequency is the total count of the occurrence of the words in the whole document. The sum of the frequencies (SF) is then calculated for that sentence i.e. f1+f2+f3+…+fN.. The average frequency (AvgF) is calculated from the formula listed below. If SF < AvgF i.e. algorithm (2) and if SF > AvgF i.e. Algorithm (3).

Following algorithm was devised:

Step 1: Calculate frequencies of the words present in the main text.

Step 2: Find the sum of the all the frequencies and hence calculate the average of the frequencies

Average Frequency(AvgF)=(Sum of frequencies of individuals words)/(Total number of words)

Calculating the average frequency (AvgF ) of the whole document.

Step 3: Store the rare words in a separate file. Rare words are those words that are defined by a frequency of one.

Step 4: Find the sum of the frequencies of each sentence and compare the sum with the value obtained from Step 2, and thus do the following:

a. If the sum of the frequencies of each of the words in the sentence is greater than the value obtained from step 2, store that sentence in one file.

b. If the sum of the frequencies of each of the words in the sentence is less than the value obtained from step 2, store those sentences in another file.

4.1.1. Intuitive Justification

The summarizer must not discard the sentences from the essay but categorize them so that in the end, there is no loss of information. There are three algorithms in this summarizer that act as “selectors” for the sentences. If one algorithm misses a sentence, then the other one stores it. This ensures that none of the sentences is left out and we have categorized results. The selection algorithm then selects the most viable summary out of the three generated summaries.

5 The Selection Algorithm

The selection algorithm selects the best-fit summary from the list of generated summaries. The design of the selection algorithm revolves around scanning the main text and the summaries generated. The main key points that have been taken into account while designing the selection algorithm are:

1.The Title generated by the Title generator should be present in the summary.

2.The length of the summary should not be very long or very short. For this, the optimal length that this algorithm considers is 1/3rd of the original paragraph.

3.The separation between the sentences is also very important. The design takes into account that the sentences selected in the summaries must not be far off from each other i.e. proximity consideration.

6. Title Generator

Results obtained after testing with ten documents suggest that the title generator was able to produce a meaningful title for just three of the documents and for the rest titles were not of any standard. The following key points were kept in mind while designing the title generator.

1.If there is a date that occur most number of times in the document, then the title generator selected that as the title.

2.If there is a proper noun like a name or a place that occur most number of times in the document, then the title generator selected that as the best title.

3.If none of the above occurs, the title generator takes into account the proper noun or a date that occur in the beginning of the document.

4. If there is no proper noun or a date in the document, then the title generator considers those words which occurs most number of times in the document and which are very close to each other at each occurrence.

5.If none of the above works then the title generator picks up that word or words that occur most number of times in the document and which lies in the very top or beginning of the document.

7 Results

7.1 The main paragraph

Established in 1907 by Jamshedji Nusserwanji Tata, Tata Steel is Asia‘s first and India‘s largest integrated private sector steel company. It is one of the few select steel companies that is EVA+ (Economic Value Added). Over the years, Tata Steel has emerged as a thriving, nimble, steel enterprise, due to its ability to transform itself rapidly to meet the challenges of a highly competitive global economy and commitment to become a supplier of choice by delighting its customers with services and products. Constant modernisation and introduction of state-of-the-art technology at Tata Steel has enabled it to stay ahead in the industry and successfully meet the expectations of all sections of stakeholders. Tata Steel’s four-phase Modernisation Programme in the steel works has enabled it to acquire the most modern steel making facilities in the world. Its captive raw material resources and the state-of-the-art 5 MTPA (million tonne per annum) plant at Jamshedpur, in Jharkhand State, India gives it a competitive edge. Determined to be a major global steel player, Tata Steel has recently included in its fold NatSteel, Asia (2 MTPA) and Millennium Steel (now Tata Steel Thailand) creating a manufacturing network in eight markets in South East Asia and Pacific rim countries. Soon the Jamshedpur plant will expand its capacity from 5 MTPA to 7 MTPA by 2008. The Company plans to enhance its capacity, manifold through organic growth and investments. The Company’s wire manufacturing unit in Sri Lanka is known as Lanka Special Steel, while the joint venture in Thailand for limestone mining is known as Sila Eastern.Its fifth phase of the Modernisation Programme leverages the intellectual capabilities of its employees to generate sustainable value for the stakeholders. Tata Steel is taking better Knowledge Management initiatives to shift focus from creating new physical assets to utilising them with ingenuity and a sturdy business sense.

7.2 Rare words display

Established in 1907 by Jamshedji Nusserwanji Tata, Tata Steel is Asia’s first and India’s largest integrated private sector steel company.It is one of the few select steel companies that is EVA+ (Economic Value Added).Over the years, Tata Steel has emerged as a thriving, nimble, steel enterprise, due to its ability to transform itself rapidly to meet the challenges of a highly competitive global economy and commitment to become a supplier of choice by delighting its customers with services and products.Constant modernisation and introduction of state-of-the-art technology at Tata Steel has enabled it to stay ahead in the industry and successfully meet the expectations of all sections of stakeholders.Tata Steel’s four-phase Modernisation Programme in the steel works has enabled it to acquire the most modern steel making facilities in the world.Its captive raw material resources and the stateof- the-art 5 MTPA (million tonne per annum) plant at Jamshedpur, in Jharkhand State, India gives it a competitive edge.Determined to be a major global steel player, Tata Steel has recently included in its fold NatSteel, Asia (2 MTPA) and Millennium Steel (now Tata Steel Thailand) creating a manufacturing network in eight markets in South East Asia and Pacific rim countries.

7.3 Below average

Its captive raw material resources and the state-of-the-art 5 MTPA (million tonne per annum) plant at Jamshedpur, in Jharkhand State, India gives it a competitive edge. Soon the Jamshedpur plant will expand its capacity from 5 MTPA to 7 MTPA by 2008. The Company plans to enhance its capacity, manifold through organic growth and investments.Its fifth phase of the Modernisation Programme leverages the intellectual capabilities of its employees to generate sustainable value for the stakeholders.

7.4 Above average

Established in 1907 by Jamshedji Nusserwanji Tata, Tata Steel is Asia’s first and India’s largest integrated private sector steel company. It is one of the few select steel companies that is EVA+ (Economic Value Added). Over the years, Tata Steel has emerged as a thriving, nimble, steel enterprise, due to its ability to transform itself rapidly to meet the challenges of a highly competitive global economy and commitment to become a supplier of choice by delighting its customers with services and products. Constant modernisation and introduction of state-of-the-art technology at Tata Steel has enabled it to stay ahead in the industry and successfully meet the expectations of all sections of stakeholders. Tata Steel’s four-phase Modernisation Programme in the steel works has enabled it to acquire the most modern steel making facilities in the world. Determined to be a major global steel player, Tata Steel has recently included in its fold NatSteel, Asia (2 MTPA) and Millennium Steel (now Tata Steel Thailand) creating a manufacturing network in eight markets in South East Asia and Pacific rim countries. The Company’s wire manufacturing unit in Sri Lanka is known as Lanka Special Steel, while the joint venture in Thailand for limestone mining is known as Sila Eastern. Tata Steel is taking better Knowledge Management initiatives to shift focus from creating new physical assets to utilising them with ingenuity and a sturdy business sense.

7.5 Title

Established 1907 Jamshedji Nusserwanji Tata

7.6 The Final Selected Summary

The summary with rare words.

8 Conclusion

The algorithms illustrated can be regarded to be among the simplest of the text summarization algorithms. However, the results that they produced are worth to be noted. The main aim of the research was to understand as to what humans think when choosing a particular word while framing the sentences. One important focus was to summarize small pieces of texts rather than larger ones as books. The main ingredient of this summarization system is the selection algorithm. The selection algorithm is the main core of the algorithm, which finally gives to the user the best-summarized content. An automatic text summarizer has to satisfy the reader in the end and that remains the main goal.

Author: Nitin Singh is part of this Research. Paper Published in springerlink.

http://www.springerlink.com/content/g7n2850444341113/

Filed Under: Technology