SweSum - A Text Summarizer for Swedish

Hercules Dalianis
NADA-KTH,
SE-100 44 Stockholm, Sweden
ph: +46 8 790 91 05
mob ph +46 70 568 13 59
fax: +46 8 10 24 77
email: hercules@nada.kth.se

August 2000

Abstract

This paper describes the State-of-the-art technique in the area of automatic text summarization and how this technique is applied on the first text summarizer for Swedish-SweSum. SweSum is build on both statistical and linguistic methods as well as heuristic methods. SweSum uses a 700.000 word entries dictionary which tells if the word belongs to the open word class group and also gives the stem of the word. SweSum has been evaluated and its performance is estimated to be as good as the State of the art techniques, for English, i.e. an average of 30% summary, compression, of 2-3 pages news text gives a good summary.

1. Background

Textsummarization or automatic text summarization is the technique which automatically creates an abstract or summary of a text. The technique has been developed for many years (Luhn 1959, Edmundson 1969 and Salton 1989) but recent years with the increased use of the Internet there have been an awakening interest for the summarization techniques.

According to Hovy and Lin (1997) there are two ways to view text summarization either as text extraction or as text abstraction. Text extraction means to extract pieces of an original text on a statistical basis or with heuristic methods and put together it to a new shorter text with the same information content. Text abstration is to parse the original text in a linguistic way, interpret the text and find new concepts to describe the text and then generate a new shorter text with the same information content.

The parsing and interpretation of a text is an old research area which has been investigated for many years. In this area we have a spectrum of techniques and methods ranging from word by word parsing to rhetorical discourse parsing as well as more statistical methods or a mixture of all.

In (Dalianis & Hovy, 1996) a method called lexical aggregation is explained where a two concepts can be replaced by another concept in a text to make the text shorter and sometimes less redundant, for example, the concepts selling and buying became business. Yet another method called syntactic aggregation is described in (Dalianis & Hovy, 1996) where coordination is performed to make a text less redundant, for example, Mary walks and John walks becames Mary and John walk. Lexical aggregation is related to the so called concept fusion described in (Hovy & Lin, 1997).

As previously explained there are three steps to perform text summarization.

First to understand the topic of a text, so called topic indentification, according to (Hovy & Lin, 1997; Lin & Hovy, 1997), secondly the interpretation of the text and finally the generation of the text. The generation of text can as previously mentioned be carried out in two different ways, namely: Extraction and Abstraction. The abstraction generation must make use of a natural language generator as for example (Dalianis, 1999).

Topic identification is also used in information retrieval when one wants to find keywords for categorizing the text in for example a library.

There are many methods to perform topic identification, see (Lin & Hovy, 1997). Word counting at concept level which is more advanced than just simple word counting, Identification of cue phrases to find the the topic.

Another method to identify topics is to perform rhetorical parsing and hence build RST tree where the nuclei is identified as the topic (Marcu, 1997). Multitext text summarization has been investigated in (McKeown & Radev, 1995).

2. Methods

Lin (1999) describes a set of summarization methods and algorithms based on extraction:

Baseline: Sentence order in text gives the importance of the sentences. First sentence highest ranking last sentence lowest ranking.

Title: Words in title and in following sentences gives high score.

Term frequency (tf): Open class terms which are frequent in the text are more important than the less frequent. Open class terms are words that change over time.

Position score: The assumption is that certain genres put important sentences in fixed positions. For example. Newspaper articles has most important terms in the 4 first paragraphs.

Query signature: The query of the user affect the summary in the way that the extract will contain these words.

Sentence length: The sentence length implies which sentence is the most important.

Average lexical connectivity: Number terms shared with other sentences. The assumption is that a sentence that share more terms with other sentences is more important.

Numerical data: Sentences containing numerical data obtain boolean value 1 (is scored higher ) than the ones without numerical values.

Propername: Dito for propernames in sentences.

Pronoun and Adjective: Dito for pronouns and adjectives in sentences. Pronouns reflecting coreference connectivity.

Weekdays and Months: Dito for Weekdays and Months:

Quotation: Sentences containing quotations might be important for certain questions from user.

First sentence: First sentence of each paragraphs are the most important sentences.

Decision tree combination function: All the above parameters were put into decision tree and trained on set of texts and manual summarized texts.

Simple combination function: All the above parameter were normalized and put in a combination function with no special weighting.

3. The methods used in SweSum

The domain of SweSum is Swedish HTML tagged newspaper text. SweSum ignores HTML tags which controls the format of the page but processes the HTML tags which controls the format of text. The summarizer is written in Perl (Wall et al., 1996) which is an excellent string processing language.

The idea is that high scoring sentences in the original text are kept in the summary, the scores are calculated accoording to the criteria below.

Since the processed text is newspaper text, we have the genre newspaper text and use therefore so called Position score: the sentences in the beginning of the text are given higher scores than the ones at the end. The formula is, 1/n, where n is the line number, so called Baseline.

HTML tags which indicate sentences with bold text are given a higher score than the ones without bold text tagging, dito title tagging. Bold text also indicates the beginning of a new paragraph in some of the Swedish news paper texts.

Sentences containing numerical data are given a higher score than the ones without numerical values.

Sentences which contains keywords are scored high so called Term frequency (tf).
To find the keywords one needs to use a dictionary of all open word classes, that means the meaning carrying-words. Since Swedish is an inflecting language it is very important that one demorphs each word, i.e. finds the stem of each word. Both the stemming and the open word class finding is carried out with a Swedish roottable (morphological lexicon) containing 700.000 words or entries (Carlberger & Kann 1999). All the above parameter are normalized and put in a naïve combination function with no special weighting to obtain the total score of each sentence.

The user of SweSum can also enter his/her own keywords to the system. The user will then obtain a more user centered summarization. The length of the summary/compressionrate is of course selected by the user. Finally we use a naïve combination function: All the above parameter are normalized and put in a combination function with no special weighting, this is acoording to (Lin, 1999).

4. Evaluation

SweSum was used in a field test within the framework of 2D1418 Språkteknologi (Human Language Technology), a 4-credit course at NADA/KTH, Stockholm. The students were given the task of automatically summarising 10 texts of news articles and movie reviews. The purpose was to see how much a text can be summarised without loosing coherence or important information.

The nine students carried out the test by first reading the text to be summarised and then gradually lowering the size of the summary giving SweSum the amount of the original text they would like in the summary, noting in a questionnaire when coherence was broken and when important information was missing. This procedure was repeated for each of the 10 texts.

As can be seen in the Appendix, not all of the students completed the whole questionnaire leaving the field test inconclusive. Despite this one can conclude that most of the time the students have come to fairly the same conclusions. There are naturally exceptions, which only serve to exemplify the subjective nature of the test.

There are no corpora with manual extracts available for Swedish as for English. Therefore it is difficult to make proper evaluation of automatic summary in Swedish.We are planning to create such corpora using the technique proposed by Marcu (1999). Since we had very few participants in our field test we decided to use median as a statistical measurement of our results. We first calculated the total amount of summarised text (given in percent).

Information Coherence

Total median: 30% 24%

Total average: 31% 26%

Table 1. Results from the field test

From the field test we can conclude that the state-of-the-art Swedish text summariser SweSum is as good as the English state-of-the-art text summarisers. According to C-Y Lin (1999) around 30% summarisation gives an ideal summarised text for English. (Lin, 1999) says also that one estimates a 70-80% accurancy on a 30% summary considering a 2-3 page news article following F-score or tf-idf. Compare this to the summarization algorithm for MS Word 97 which gives the best summarization at around 35% summary of an text, Lin (1999).

SweSum is available for testing at (Dalianis, 2000). There is also an english version of the summarizer, where the Swedish roottable is replaced by an English one, the textsummarization engine is identical.

5. Future extensions

Future extensions of the SweSum can be the following:

We are currently working on pronoun resolution to make the summarized text more coherent. Incoherencies happens specifically when the summaries are below 30% of the original texts and e.g.when a pronoun reference hangs free with no reference in the text. Pronoun resolution will resolve the pronouns in the text and replace them with the original noun when necessary. Some early results are described in (Hassel & Dalianis 2000).

One possibility when doing topic identification or keyword extraction is to use a synonym term by using some sort of Swedish Ontology similar to Wordnet, namely Swedish Wordnet - Swordnet currently devoloped at the department of Linguistics at Lund University. Compare selling and buying became business above.

Before performing the summarization, analyze the text to be summarized by using RIX Readibility Index or letting the user select different profiles for the summarization depending on the type of text - Newspaper text, Academic Articles, Business Reports, Technical Reports, Social Science Report etc. Studies on how to recognize text genres by using an advanced RIX has been carried out by Karlgren & Cutting (1994) and Karlgren (2000).

Regarding the text summarization in other languages it easy in to replace the Swedish roottable with the roottable of other languages but use the same text summarization engine and hence accomplish automatic text summarization in other languages.

Acknowledgements

I would like to thank Johan Carlberger at Nada/KTH in acquiring the Swedish morphological lexicon for SweSum. I would also like to thank Martin Hassel at DSV-Stockholm University/KTH for interpreting the results of the evaluation of SweSum and finally I would like to thank the students of the course 2D1418 for their willingness to participate in our field studies.

6. References

J. Carlberger and V. Kann. 1999. Implementing an efficient part-of-speech tagger. Software Practice and Experience, 29, 815-832.

H. Dalianis. 2000. SweSum - A Swedish Textsummarizer.
http://www.nada.kth.se/~hercules/Textsumsummary.html (this report) and the summarizer
http://www.nada.kth.se/~xmartin/swesum/index.html and http://www.nada.kth.se/~hercules/textsum/textsum8.html

H. Dalianis. 1999. ASTROGEN - Aggregated deep and Surface naTuRal language GENerator http://www.dsv.su.se/~hercules/ASTROGEN/ASTROGEN.html

H. Dalianis and E. Hovy. 1996 Aggregation in Natural Language Generation. In Adorni, G. & Zock, M. (Eds.), Trends in Natural Language Generation: an Artificial Intelligence Perspective, EWNLG'93, Fourth European Workshop, Lecture Notes in Artificial Intelligence, No. 1036, pp. 88-105, Springer Verlag

H. Dalianis & E. Hovy. 1997. On Lexical Aggregation and Ordering. In Proceedings of the 6th European Workshop on Natural Language Generation, pp. 17-27, March 24 - 26, 1997, Gerhard-Mercator University, Duisburg, Germany.

H.P. Edmundson. 1969. New Methods in Automatic Extraction Journal of the ACM 16(2) pp 264-285.

M. Hassel and H. Dalianis. 2000. Pronominal Resolution in Text Summarisation, submitted to the ACL'2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval, October 7-8, 2000, Hong-Kong.

E. Hovy and C-Y Lin. 1997. Automated Text Summarization in SUMMARIST. in Proceedings of the Workshop of Intelligent Scalable Text Summarization, July.

J.Karlgren. 2000. Stylistic Experiments for Information Retrieval. Ph.D. Thesis, (Filosofie Doktorsavhandling), Department of Linguistics, Stockholm University, 2000.

J. Karlgren and D. Cutting. 1994. Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proceedings of COLING 94, Kyoto, Japan.

C-Y Lin and E. Hovy. 1997. Identify Topics by Position, Proceedings of the 5th Conference on Applied Natural Language Processing, March.

C-Y Lin. 1999 Training a Selection Function for Extraction, submitted to SIGIR 99.

H.P. Luhn.1959. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development pp 159-165.

D.Marcu. 1999. The construction of large-scale corpora for summarization research, In the Proceedings of the International Conference on Research and Development in Information Retrieval SIGIR-99, pp. 137-144.

D. Marcu. 1997. From Discourse Structures to Text Summaries. The Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, pp 82-88, Madrid, Spain,.July.

K. McKeown and D. Radev. 1995. Generating summaries of multiple news articles. In Proceedings, 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 74-82, Seattle, Washington, July.

G. Salton, 1989. Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer. Addison Wesley Publishing Company.

L. Wall, T. Christensen, and R.L. Schwartz. 1996. Programming Perl. O'Reilly & Associates Inc.

Appendix: Results from the field test

In the tables below, P1ÖP9 on the horizontal scale stands for person 1Ö9 in the test group. The numbers on the vertical scale each represent a text in the field test. The amounts given in percentage represent how much of the original text was extracted to the summary.

Important information is missing at:

Text\Person P1 P2 P3 P4 P5 P6 P7 P8 P9

1

37%

25%

40%

15%

35%

20%

30%

35%

49%

2

40%

25%

55%

45%

45%

20%

40%

25%

53%

3

30%

30%

45%

25%

50%

25%

20%

30%

55%

4

20%

15%

20%

15%

35%

35%

50%

25%

29%

5

60%

20%

10%

15%

20%

15%

20%

45%

15%

6

37%

30%

99%

35%

40%

40%

30%

25%

7

44%

25%

30%

30%

25%

50%

50%

41%

8

30%

30%

30%

25%

35%

40%

9

20%

15%

20%

25%

15%

30%

19%

10

15%

10%

10%

35%

10%

20%

Coherence is broken at:

Text\Person P1 P2 P3 P4 P5 P6 P7 P8 P9

1

25%

20%

35%

30%

35%

25%

20%

20%

2

20%

30%

55%

30%

60%

20%

15%

53%

3

27%

25%

25%

25%

70%

25%

15%

25%

55%

4

15%

15%

15%

10%

5%

5

30%

20%

10%

20%

20%

15%

25%

6

30%

10%

30%

35%

35%

10%

15%

15%

7

20%

10%

25%

30%

40%

30%

35%

41%

8

20%

60%

25%

35%

35%

20%

9

5%

20%

40%

10%

10%

44%

10

5%

10%

35%

10%

10%

	Information	Coherence
Total median:	30%	24%
Total average:	31%	26%

Important information is missing at:
Text\Person	P1	P2	P3	P4	P5	P6	P7	P8	P9
1	37%	25%	40%	15%	35%	20%	30%	35%	49%
2	40%	25%	55%	45%	45%	20%	40%	25%	53%
3	30%	30%	45%	25%	50%	25%	20%	30%	55%
4	20%	15%	20%	15%	35%	35%	50%	25%	29%
5	60%	20%	10%	15%	20%	15%	20%	45%	15%
6	37%	30%	99%	35%	40%	40%	30%	25%
7	44%	25%		30%	30%	25%	50%	50%	41%
8		30%	30%	30%	25%	35%	40%
9		20%	15%	20%	25%	15%	30%		19%
10		15%	10%	10%	35%	10%	20%

Coherence is broken at:
Text\Person	P1	P2	P3	P4	P5	P6	P7	P8	P9
1	25%	20%	35%	30%	35%	25%	20%	20%
2	20%		30%	55%	30%	60%	20%	15%	53%
3	27%	25%	25%	25%	70%	25%	15%	25%	55%
4	15%			15%	15%		10%	5%
5	30%		20%	10%	20%	20%	15%	25%
6	30%	10%	30%	35%	35%	10%	15%	15%
7	20%	10%		25%	30%	40%	30%	35%	41%
8		20%	60%	25%	35%	35%	20%
9		5%		20%	40%	10%	10%		44%
10		5%		10%	35%	10%	10%