You are here

Rule-Based Filtering Algorithm for Textual Document

Journal Name:

Publication Year:

Abstract (2. Language): 
Textual document is usually in unstructured form and high dimensional data. The exploration of hidden information from the unstructured text is useful to find interesting patterns and valuable knowledge. However, not all terms in the text are relevant and can lead to misclassification. Improper filtration might cause terms that have similar meaning to be removed. Thus, to reduce the high-dimensionality of text, this study proposed a filtering algorithm that is able to filter the important terms from the pre-processed text and applied term weighting scheme to solve synonym problem which will help the selection of relevant term. The proposed filtering algorithm utilizes a keyword library that contained special terms which is developed to ensure that important terms are not eliminated during filtration process. The performance of the proposed filtering algorithm is compared with rough set attribute reduction (RSAR) and information retrieval (IR) approaches. From the experiment, the proposed filtering algorithm has outperformed both RSAR and IR in terms of extracted relevant terms.
44
48

REFERENCES

References: 

[1] K. Sumathy and M. Chidambaram, “Text Mining: Concepts, Applications, Tools and Issues–An Overview,” International Journal of Computer Applications, vol. 80, no. 4, pp. 29–32, 2013.
[2] S. Jusoh and H. M. Alfawareh, “Techniques, Applications and Challenging Issue in Text Mining,” International Journal of Computer Science Issues, vol. 9, no. 6, pp. 431–436, 2012.
[3] S.S. Kamaruddin, "Framework for deviation detection in text." Universiti Kebangsaan Malaysia, Bangi. 2011. [4] J. I. Sheeba, and K. Vivekanandan, K. "Improved Unsupervised Framework for solving Synonym, Homonym, Hyponymy & Polysemy Problems from Extracted Keywords and Identify topics in Meeting Transcripts." International Journal of Computer Science, Engineering and Applications, 2(5), 85. 2012. [5] J. Ventura, and J.F. da Silva. "Ranking and extraction of relevant single words in text." INTECH Open Access Publisher. 2008.
[6] H. S. Baghdadi and B. Ranaivo-Malançon, “An Automatic Topic Identification Algorithm,” Journal of Computer Science, vol. 7, no. 9, pp. 1363–1367, 2011.
[7] A. Khan, B. Baharudin, L.H. Lee, and K. Khan, "A review of machine learning algorithms for text-documents classification." Journal of Advances in Information Technology. 1(1). 2010.
[8] C. H. Bong, and T.K. Wong, "An examination of feature selection frameworks in text categorization. Information Retrieval Technology." 3689. 2005.
[9] J. Bakus, and M.S. Kamel, "Higher order feature selection for text classification." Knowledge Information System. 9, 4. 468-491. 2006. [10] A. K. Uysal, S. Gunal, S. Ergin, and E. Sora Gunal, "The Impact of Feature Extraction and Selection on SMS Spam Filtering". Elektronika ir Elektrotechnika, 19(5), 67-72. 2012. [11] S. Beniwal, and J. Arora, "Classification and feature selection techniques in data mining." In International Journal of Engineering Research and Technology (Vol. 1, No. 6 (August-2012)). ESRSA Publications. 2012, August. [12] A. T. Sadiq, and S.M. Abdullah, "Hybrid Intelligent Technique for Text Categorization." In Advanced Computer Science Applications and Technologies (ACSAT), 2012 International Conference on (pp. 238-245). IEEE. (2012, November) [13] J. McCrae, E. Montiel-Ponsoda, and P. Cimiano, "Integrating WordNet and Wiktionary with lemon." In Linked Data in Linguistics (pp. 25-34). Springer Berlin Heidelberg. 2012. [14] E. Gabrilovich, and S. Markovitch, "Computing semantic relatedness using Wikipedia-based explicit semantic analysis." In IJcAI (Vol. 7, pp. 1606-1611). 2007, January. [15] X. H. Phan, C.T. Nguyen, D.T. Le, L.M. Nguyen, S. Horiguchi, and Q.T. Ha, "A hidden topic-based framework toward building applications with short web documents." IEEE Transactions on Knowledge and Data Engineering, 23(7), 961-976. 2011. [16] M. Hassan, "Automatic Document Topic Identification Using Hierarchical Ontology Extracted from Human Background Knowledge" (Doctoral dissertation, University of Waterloo). 2013 [17] T. Wang, and G. Hirst, "Exploring patterns in dictionary definitions for synonym extraction." Natural Language Engineering, 18(03), 313-342. 2012.
[18] S. Berkowitz, "U.S. Patent No. 7,805,291." Washington, DC: U.S. Patent and Trademark Office. 2010.
[19] B. M. Sagar, G. Shobha, and R. Kumar, Solving the Noun Phrase and Verb Phrase Agreement in Kannada Sentences. International Journal of Computer Theory and Engineering, 1(3). 2009. [20] G. Protaziuk, M. Kryszkiewicz, H. Rybinski, and A. Delteil, "Discovering compound and proper nouns." In Rough Sets and Intelligent Systems Paradigms (pp. 505-515). Springer Berlin Heidelberg. 2007. [21] R. Dong, M. Schaal, M.P. O’Mahony, and B. Smyth, "Topic extraction from online reviews for classification and recommendation." Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. 2013. [22] T. P. Hong, C.W. Lin, K. T. Yang, and S. L. Wang, "Using TF-IDF to hide sensitive item sets." Applied Intelligence, 1-9. 2013.
[23] W. Zhang, T. Yoshida, and X. Tang, "A comparative study of TF*IDF, LSI and multi-words for text classification." Expert Systems with Applications, 38(3), 2758-2765. 2011. [24] J. Ramos, "Using tf-idf to determine word relevance in document queries." In Proceedings of the first instructional conference on machine learning. 2003.

Thank you for copying data from http://www.arastirmax.com