Tabsum- A New Persian Text Summarizer


Authors

Saeid Masoumi - M.Sc in Software Engineering at University of Tabriz, Tabriz, Iran. Mohammad-Reza Feizi-derakhshi - Assistant professor at University of Tabriz, Tabriz, Iran. Raziyeh Tabatabaei - M.Sc in Software Engineering at University of Tabriz, Tabriz, Iran.


Abstract

With the rapid increase in the amount of online text information, it became more important to have tools that would help users distinguish the important content. Automatic text summarization attempts to address this problem by taking an input text and extracting the most important content of it. However, the determination of the salience of information in the text depends on different factors and remains as a key problem of automatic text summarization. In the literature, there are some studies that use lexical chains as an indicator of lexical cohesion in the text and as an intermediate representation for text summarization. Also, some studies make use of genetic algorithms in order to examine some manually generated summaries and learn the patterns in the text which lead to the summaries by identifying relevant features which are most correlated with human generated summaries. In this study, we combine these two approaches of summarization. Firstly, some of preprocessing operations like normalizer, tokenizer, stop word remover, stemmer, and POS tagger are done on the text. After that for each sentence we have only semantic words that are independent. Then, by set of position, thematic, and coherence features we score sentences. The final score of each sentence will be the integration of those features. Each feature has its own weight and should be identified to have well summary. For this reason first system goes throw learning phase to determine ache feature weight by genetic algorithm. The next phase is testing phase. In this phase system receives new documents and uses Persian WordNet and lexical chains to extract deep level of knowledge about the text. This knowledge is combined with other higher level analysis results. Finally, sentences are scored, sorted, and selected and summary is made. We evaluated our proposed system by two methods. 1) Precision/recall, 2) TabEval (a new evaluation tool for Persian text summarizers). We compared our system with two other Persian summarizers (FarsiSum, Ijaz). Results showed that our system had higher performance rather than others (i.e. higher precision/recall average and the best average score of TabEval).


Keywords


MSC


References