Pattern Mining for Information Retrieval from Large Databases

Abstract

With the widespread use of databases and explosive growth in their sizes are reason for the attraction of the data mining for retrieving the useful informations. Desktop has been used by tens of millions of people and we have been humbled by its usage and great user feedback. However over the past seven years we have also witnessed some changes in how users store and access their own data, with many moving to web based application. Despite the increasing amount of information available in the internet, storing files in personal computer is a common habit among internet users. The motivation is to develop a local search engine for users to have instant access to their personal information.The quality of extracted features is the key issue to text mining due to the large number of terms, phrases, and noise. Most existing text mining methods are based on term-based approaches which extract terms from a training set for describing relevant information. However, the quality of the extracted terms in text documents may be not high because of lot of noise in text. For many years, some researchers make use of various phrases that have more semantics than single words to improve the relevance, but many experiments do not support the effective use of phrases since they have low frequency of occurrence, and include many redundant and noise phrases. In this paper, we propose a novel pattern discovery approach for text mining.To evaluate the proposed approach, we adopt the feature extraction method for Information Retrieval (IR).

Keywords –Pattern mining, Text mining, Information retrieval, Closed pattern.

Introduction

In the past decade, for retrieving an information from the large database a significant number of datamining techniques have been presented that includes association rule mining, sequential pattern mining, and closed pattern mining. These methods are used to find out the patterns in a reasonable time frame, but it is difficult to use the discovered pattern in the field of text mining. Text mining is the process of discovering interesting information in the text documents. Information retrieval provide many methods to find the accurate knowledge form the text documents. The most commonly used method for finding the knowledge is the phrase based approaches, but the method have many problems such as phrases have low frequency of occurrence, and there are large number of noisy phrases among them.If the minimum support is decreased then it will create lot of noisy pattern

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Essay Writing Service

Pattern Classification Method

To find the knowledge effectively without the problem of low frequency and misinterpretation a pattern based approach (Pattern classification method) is discovered in this paper. This approach first find out the common character of pattern and evaluates the weight of the terms based on distribution of terms in the discovered pattern. It solves the problem of misinterpretation. The low frequency problem can also be reduced by using the pattern in the negatively trained examples. To discover patterns many algorithms are used such as Apriori algorithm, FP-tree algorithm, but these algorithms does not tell how to use the discovered patterns effectively. The pattern classification method uses closed sequential pattern to deal with large amount of discovered patterns efficiently. It uses the concept of closed pattern in text mining.

Preprocessing

The first step towards handling and analyzing textual data formats in general is to consider the text based information available in free formatted text documents.Real world databases are highly susceptible to noisy, missing, and inconsistent data due to their huge size. These low quality data will lead to low quality mining results. Initially the preprocessing is done with text document while storing the content into desktop systems.Commonly the information would be processed manually by reading thoroughly and then human domain experts would decide whether the information was good or bad (positive or negative). This is expensive in relation to the time and effort required from the domain experts. This method includes two process.

Removing stop words and stem words

To begin the automated text classification process the input data needs to be represented in a suitable format for the application of different textual data mining techniques, the first step is to remove the un-necessary information available in the form of stop words.Stop words are words that are deemed irrelevant even though they may appear frequently in the document. These are verbs, conjunctions, disjunctions and pronouns, etc. (e.g. is, am, the, of, an, we, our). These words need to be removed as they are less useful in interpreting the meaning of text.

Stemming is defined as the process of conflating the words to their original stem, base or root. Several words are small syntactic variants of each other since they share a common word stem. In this paper simple stemming is applied where words e.g. ‘deliver’, ‘delivering’ and ‘delivered’ are stemmed to ‘deliver’. This method helps to capture whole information carrying term space and also reduces the dimensions of the data which ultimately affects the classification task. There are many algorithms used to implement the stemming method. They are Snowball, Lancaster and the Porter stemmer. Comparing with others Porter stemmer algorithm is an efficient algorithm. It is a simple rule based algorithm that replaces a word by an another. Rules are in the form of (condition)s1->s2 where s1, s2 are words. The replacement can be done in many ways such as, replacing sses by ss, ies by i, replacing past tense and progressive, cleaning up, replacing y by i, etc.

Weight Calculation

The weight of the each term is calculated by multiplying the term frequency and inverse document frequency. Term frequency find the occurrence of the individual terms and counts. Inverse document frequency is a measure of whether a term is common or rare across all documents.

Term Frequency:

Tf(t,d)=0.5+0.5*f(t,d)/max{f(w,d):wbelongs to d}

Where d represents single document and t represents the terms

Inverse Document Frequency:

IDF(t,D)= log(Total no of doc./No of doc. Containing the term)

Where D represents the total number of documents

Weight:

Wt=Tf*IDF

Clustering

Cluster is a collection of data objects. Similar to one another within the same cluster. Cluster analysis will find similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.Clustering is defined as a process of grouping data or information into groups of similar types using some physical or quantitative measures. It is an unsupervised learning. Cluster analysis used in many applications such as, pattern recognition, data analysis and web for information discovery. Cluster analysis support many types of data like, Data matrix, Interval scaled variables, Nominal variables, Binary variables and variables of mixed types. There are many methods used for clustering. The methods are partitioning methods, hierarchical methods, density based methods, grid based methods and model based methods. In this paper partitioning method is proposed for clustering.

Partitioning methods

This method classifies the data into k-groups, which together satisfy the following requirements: (1) each group must contain at least one object, (2) each object must belong to exactly one group. Given a database of n objects, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k<=n. In a good partitioning method objects in the same cluster are related to each other, whereas objects of different clusters are very different. Partitioning algorithm is nonhierarchical. Since only one set of clusters is output, the user normally has to input the desired number of clusters k. It produce clusters by optimizing a criterion function define either locally or globally. The most commonly used partitional clustering strategy is based on the square error criterion. The general objective is to obtain the partition that, for a fixed number of clusters, minimizes the total square error This paper represents the partitioning method by using the k-means algorithm.

K-means algorithm

K-means is one of the simplest unsupervised learning algorithms. It takes the input parameter, k, and partitions a set of n objects into k-clusters so that the resulting intra cluster similarity is high but the inter cluster similarity is low. It is centroid based technique. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the clusters centroid.

Input:k: the number of clusters,

D: a data set containing n objects.

Output:

A set of k clusters.

Methods:

Select an initial partition with k clusters containing randomly chosen samples, and compute the centroids of the clusters.
Generate a new partition by assigning each sample to the closest cluster center.
Compute new cluster centers as the centroids of the cluster.
Repeat steps 2 and 3 until an optimum value of the criterion function is found or until the cluster membership stabilizes.

This algorithm faster than hierarchical clustering. But it is not suitable to discover clusters with non-convex shapes.

Fig.1. K-Means Clustering

Classification

It predicts categorical class labels and classifies the data based on the training set and the values in classifying the attribute and uses it in classifying the new data. Data classification is a two step process (1) learning, (2) classification. Learning can be classified into two types supervised and unsupervised learning. The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data. There are many classification methods are available such as, K-nearest neighbor, Genetic algorithm, Rough Set Approach, and Fuzzy Set approaches.The classification technique measures the nearing occurrence. It assumes the training set includes not only the data in the set but also the desired classification for each item. The classification is done through training samples, where the entire training set includes not only the data in the set, but also the desired classification for each item. The Proposed approaches find the minimum distance from the new or incoming instance to the training samples. On the basis of finding the minimum distance only the closest entries in the training set are considered and thenew item is placed into the classwhich contains the most items of the K. Here classify thesimilarity text documents and file indexing is performed to retrieve the file in effective manner.

Result and Discussion

The input file is given and initial preprocessing is done with that file. To find the match with any other training sample inverse document frequency is calculated. To find the similarities between documents clustering is performed.Then classification is performed to find the input matches with any of the clusters. If it matches the particular cluster file will be listed.Theclassification techniques classify the various file formats and the report is generated as percentage of files available. The graphical representation shows the clear representation of files available in various formats. This method uses least amount of patterns for concept learning compare to other methods such as, Rocchio, Prob, nGram , the concept based models and the most BM25 and SVM models. The proposed model is achieved the high performance and it determined the relevant information what users want. This method reduces the side effects of noisy patterns because the term weight is not only based on term space but it also based on patterns. The proper usage of discovered patterns is used to overcome the misinterpretation problem and provide a feasible solution to effectively exploit the vast amount of patterns generated by data mining algorithms.

Conclusion

Storing huge amount of files in personal computers is a common habit among internet users, which is essentially justified for the following reasons,

1) The information will not always permanent

2) The retrieval of information differs based on the different query search

3) Location same sites for retrieving information is difficult to remember

4) Obtaining information is not always immediate. But these habits have many drawbacks. It is difficult to find when the data is required.In the Internet, the use of searching techniques is now widespread, but in terms of personal computers, the tools are quite limited. The normal “Search or “Find” options take several hours to produce the search result. It acquires more time to predict the desire result where the time consumption is high.The proposed system provides accurate result comparing to normal search.All files are indexed and clustered using the efficient k means techniques so the information retrieved in efficient manner.

The best and advanced clustering gadget provides optimized time results.Downtime and power consumption is reduced.

References

[1]K. Aas and L. Eikvil, ‘’Text Categorization: A Survey,’’ Technical Report NR 941, Norwegian Computing Centre, 1999.

[2] R. Agarwal and R.Srikanth, ‘’Fast Algorithm for Mining Association Rules in Large Databases, ‘’ Proc. 20^th Int’l Conf. Very Large Data Bases(VLDB’94), pp.478-499, 1994.

[3] H. Ahonen, O. Heinonen, M. Klemettinen, and A.I. Verkamo, “Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections,” Proc. IEEE Int’l Forum on Research and Technology Advances in Digital Libraries (ADL ’98), pp. 2-11, 1998.

[4] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, 1999.

[5] N. Cancedda, N. Cesa-Bianchi, A. Conconi, and C. Gentile, “Kernel Methods for Document Filtering,” TREC, trec.nist.gov/ pubs/trec11/papers/kermit.ps.gz, 2002.

[6] N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders, “Word- Sequence Kernels,” J. Machine Learning Research, vol. 3, pp. 1059- 1082, 2003.

[7] M.F. Caropreso, S. Matwin, and F. Sebastiani, “Statistical Phrases in Automated Text Categorization,” Technical Report IEI-B4-07- 2000, Instituto di Elaborazionedell’Informazione, 2000.

[8] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.

[9] S.T. Dumais, “Improving the Retrieval of Information from External Sources,” Behavior Research Methods, Instruments, and Computers, vol. 23, no. 2, pp. 229-236, 1991.

[10] J. Han and K.C.-C. Chang, “Data Mining for Web Intelligence,” Computer, vol. 35, no. 11, pp. 64-70, Nov. 2002.

[11] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’00), pp. 1-12, 2000.

[12] Y. Huang and S. Lin, “Mining Sequential Patterns Using Graph Search Techniques,” Proc. 27th Ann. Int’l Computer Software and Applications Conf., pp. 4-9, 2003.

[13] N. Jindal and B. Liu, “Identifying Comparative Sentences in Text Documents,” Proc. 29th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’06), pp. 244-251, 2006. [14] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with tfidf for Text Categorization,” Proc. 14th Int’l Conf. Machine Learning (ICML ’97), pp. 143-151, 1997.

[15] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. European Conf. Machine Learning (ICML ’98),, pp. 137-142, 1998.

[16] T. Joachims, “Transductive Inference for Text Classification Using Support Vector Machines,” Proc. 16th Int’l Conf. Machine Learning (ICML ’99), pp. 200-209, 1999.

[17] W. Lam, M.E. Ruiz, and P. Srinivasan, “Automatic Text Categorization and Its Application to Text Retrieval,” IEEE Trans. Knowledge and Data Eng., vol. 11, no. 6, pp. 865-879, Nov./Dec. 1999.

[18] D.D. Lewis, “An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,” Proc. 15th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’92), pp. 37-50, 1992.

[19] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language, pp. 212-217, 1992.

[20] D.D. Lewis, “Evaluating and Optimizing Automous Text Classification Systems,” Proc. 18th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’95), pp. 246-254, 1995.

[21] G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management: An Int’l J., vol. 24, no. 5, pp. 513-523, 1988.

[22] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.

[23] Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization,” Information Retrieval, vol. 1, pp. 69-90, 1999.

[24] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. 22nd Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’99), pp. 42-49, 1999.

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

Cite This Work

To export a reference to this article please select a referencing style below: