Buradasınız

Managing word form variation of text retrieval in practice – Why language technology is not the only cure for better IR performance?

Journal Name:

Publication Year:

Author Name
Abstract (2. Language): 
Purpose: The article discusses on a general methodological level different methods that have been used for management of single key word form variation in information retrieval during the history of textual information retrieval. The paper offers the reader an overall practical guide for choosing between different methods to be used for different types of European languages. Methods being compared in the paper include stemming, lemmatization, truncation, syllabification, unsupervised morphological methods, character n-gramming and generation of inflected word forms. Methodology/Approach: Based on the empirical findings and results achieved by other researchers the paper discusses several pros and cons of different keyword variation management methods in a broader context than usually in IR, where only achieved effectiveness results are normally considered. The study proposes a list of five criteria for comparison of the conflation methods in general and offer a heuristics for choosing a suitable method for conflation of a specific language. Findings: Simpler character-based methods could be preferred in IR instead of very sophisticated linguistic methods. It is also suggested that for morphologically simple languages, such as English, any kind of keyword variation management may be futile, as the increase in IR effectiveness achieved may be very low. Morphologically more complex languages can be conflated with the simple methods quite effectively for present IR search engines.
1
21

REFERENCES

References: 

Alkula, R. (2001). From plain character strings to meaningful words: producing better full text databases for inflectional and compounding languages with morphological analysis software. Information Retrieval, 4 (3-4), 195−208.
Bane, M. (2008). Quantifying and measuring morphological complexity. In Proceedings of the 26th West Coast Conference on Formal Linguistics (pp. 67–76). Retrieved from
http://www.lingref.com/cpp/wccfl/26/paper1657.pdf
Church, K.W. (2005). The DDI approach to morphology. In A. Arppe, L. Carlson, K. Lindén, J. Piitulainen, M. Suominen, M. Vainio, H. Westerlund and A. Yli-Jyrä (eds.), Inquiries into Words, Constraints and Contexts. Festschrift for Kimmo Koskenniemi on his 60th Birthday. (p. 25-34). Retrieved from
http://cslipublications.stanford.edu/koskenniemi-festschrift/kk-festschr...
Croft, B. W., Metzler, D. & Strohman, T. (2010). Search Engines. Information Retrieval in Practice. Boston, Paris: Pearson.
Croft, B. W., Metzler, D. & Strohman, T. (2010 a). Search Engines. Information Retrieval in Practice (pp. 13-28). Boston, Paris: Pearson.
Croft, B. W., Metzler, D. & Strohman, T. (2010 b). Search Engines. Information Retrieval in Practice (p. 327). Boston, Paris: Pearson.
Ehret, K. & Szmrecsanyi, B. (2011). An information-theoretic approach to assess linguistic complexity. Retrieved from
http://www.benszm.net/omnibuslit/EhretSzmrecsanyi_web.pdf
Galvez, C., Moya-Anegón, F. de & Solana, V. H. (2005). Term conflation methods in information retrieval. Non-linguistic and linguistic approaches. Journal of Documentation, 61 (4), 520–547.
Grünwald, P. (2007). The Minimum Description Length Principle (p.29). Cambridge, Mass: MIT Press.
Hammarström, H. & Borin, L. (2011). Unsupervised learning of morphology. Computational Linguistics, 37 (2), 309–350.
Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42 (1), 7-15.
Iggesen, O.A. (2011). Number of cases. In M. S. Dryer M. and Haspelmath (eds.) The World Atlas of Language Structures Online. Munich: Max Planck Digital Library, chapter 49A. Retrieved from
http://wals.info/chapter/49A
Ingwersen, P. & Järvelin, K. (2005). The Turn. Integration of Information Seeking and Retrieval in Context. Dordrecht : Springer.
Ingwersen, P. & Järvelin, K. (2005 a). The Turn. Integration of Information Seeking and Retrieval in Context (p. 119). Dordrecht : Springer.
Managing word form variation of text retrieval in practice Kettunen
TRIM 9(1) 20
Ingwersen, P. & Järvelin, K. (2005 b). The Turn. Integration of Information Seeking and Retrieval in Context (p.115). Dordrecht : Springer.
Juola, P. (1998). Measuring linguistic complexity: the morphological tier. Journal of Quantitative Linguistics, 5(3), 206–13.
Juola, P. (2008). Assessing linguistic complexity. In M. Miestamo, K. Sinnemäki and F. Karlsson (eds.) Language Complexity : Typology, Contact, Change. Amsterdam: John Benjamins Press.
Kettunen, K. & Airio, E. (2006). Is a morphologically complex language really that complex in full-text retrieval? In T. Salakoski et al. (eds.), Advances in Natural Language Processing, LNAI 4139 (p. 411–422). Berlin Heidelberg: Springer-Verlag.
Kettunen, K. & Arvola, P. (2012). Generating variant keyword forms for a morphologically complex language leads to successful information retrieval with Finnish. In B. Larsen and M. Salampasis (eds.), Advances in Multidisciplinary Retrieval, 5th Information Retrieval Facility Conference (pp. 113-126).
Kettunen, K. (2009). Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval. Journal of Documentation, 65 (2), 267–290.
Kettunen, K., McNamee, P. & Baskaya, F. (2010). Using syllables as indexing terms in full-text retrieval. In I. Skadina, A. Vasiljevs (eds), Human Language Technologies, the Baltic Perspective (pp. 225−232). IOS Press.
Koskenniemi, K. (1996). Finite state morphology and information retrieval. Natural Language Engineering, 2 (4), 331–336.
Kurimo, M., Virpioja, S. & Turunen, V. (eds.) (2010). Proceedings of the Morpho Challenge 2010 workshop. Technical Report TKK-ICS-R37, Aalto University School of Science and Technology, Department of Information and Computer Science, Espoo, Finland. Retrieved from
http://research.ics.aalto.fi/events/morphochallenge2010/papers/ProcMorph....
Lazarinis, F., Vilares, J., Tait, J. & Efthimiadis, E. (2009). Current research issues and trends in non-English Web searching. Information Retrieval, 12 (3), 230-250.
Leturia, I., Gurrutxaga, A., Areta, N., Alegria, I. & Ezeiza, A. (2012). Morphological query expansion and language-filtering words for improving Basque web retrieval. Language Resources & Evaluation. DOI: 10.1007/s10579-012-9208-x
Loponen, A. & Järvelin, K. (2010). A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In M. Agosti, N. Ferro, C. Peters, M. de Rijke, and A.
Managing word form variation of text retrieval in practice Kettunen
TRIM 9(1) 21
Smeaton (eds.) CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum. LNCS vol. 6360, (pp. 3-14). Heidelberg: Springer.
Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistic, 11, pp. 23–31.
McNamee, P., Nicholas, C. & Mayfield, J. (2009). Addressing morphological variation in alphabetic languages. In Proceedings of the 32nd Annual International Conference on Research and Development in Information Retrieval (SIGIR-2009), Boston, MA, 75-82.
Pirkola, A. & Järvelin, K. Employing the resolution power of search keys. Journal of the American Society for Information Science and Technology, 52 (7), 575−583.
Pirkola, A. (2001). Morphological typology of languages for IR. Journal of Documentation 57 (3), 330-348.
Sadeniemi, M., Kettunen, K., Lindh-Knuutila, T. & Honkela, T. (2008). Complexity of European Union languages: a comparative approach. Journal of Quantitative Linguistics, 15 (2), 185–211.
Sparck-Jones, K. (1974). Automatic indexing. Journal of Documentation, 30 (4), 393-432.
Stump, G. T. (2001). Inflection. In A. Spencer and A. Zwicky (eds), The Handbook of Morphology (pp.13-43). Hoboken, NJ: John Wiley and Sons.
Uyar, A. (2009). Google stemming mechanisms. Journal of Information Science, 35 (5), 499-514.

Thank you for copying data from http://www.arastirmax.com