Regression Analysis of the Number of Association Rules

期刊名字：国际自动化与计算杂志（英文版）
文件大小：813kb
论文作者：Wei-Guo Yi，Ming-Yu Lu，Zhi Liu
作者单位：Department of Information Science and Technology,Department of Software Institute
更新时间：2020-11-22
下载次数：次

论文简介

International Journal of Automation and Computing8(1), February 2011, 78-82DOI: 10.1007/511633-010-0557-xRegression Analysis of the Number of Association RulesWei-Guo Yi1+2Ming-Yu Lul'Department of Information Science and Technology, Dalian Maritime University, Dalian 116026, PRC2 Department of Software Institute, Dalian Jiaotong U niversity, Dalian 116052, PRCAbstract: The typical model, which involves the measures: support, confdence, and interest, is often adapted to mining associationrules. In the model, the related parameters are usually chosen by experience; consequently, the number of useful rules is hard toestimate. If the number is too large, we cannot effectively extract the meaningful rules. This paper analyzes the meanings of theparameters and designs a variety of equations between the number of rules and the parameters by using regression method. Finally,we experimentally obtain a preferable regression equation. This paper uses multiple correlation cofficients to test the ftting effectsof the equations and uses significance test to verify whether the cofficients of parameters are significantly zero or not. The regressionequation that has a larger multiple correlation coefficient will be chosen as the optimally fitted equation. With the selected optimalequation, we can predict the number of rules under the given parameters and further optimize the choice of the three parameters anddetermine their ranges of values.Keywords: Association rules, regression analysis, multiple correlation cofficients, interest, support, confidence.1 Introductionforward a number of metrics of interest. In [6]， threetypes of metrics (any-confidence, all-confidence, and bond)Mining association rule is an important research fieldwere proposed to take place of support. In [7]， a sig-in data mining and has been widely used. The cur-nificance measure for association was proposed by mak-rent research mainly includes two aspects: generating efing use of x2 correlation test. Its interest was defined asficiency of frequent itemsets and the correlation analysisInterest= P(AB)/(P(A)P(B)). In [8], the interest of neg-of rules.improve the fficiency of mining frequentative association rules: P(A)P(B)/P(A∪B) was proposed.sets, a variety of methods have been proposed since 1993In [9], a framework based on correlation was proposed. This .when Agrawal et al.[1] proposed the classic apriori algo-framework uses residual analysis to determine whether tworithm. The Closet+ algorithm!24 uses the frequent patternitems were independent of each other. Residual analysistree (FP-tree) as a storage structure. However, its projec-can easily obtain the real (rather than concurrent) associ-tion mode of the establishment of conditions database isation rules including the negative rules. In [10], a methoddiferent from that of FP-tree and Closet. Closet+ usesbased on linear regression and reverse authentication wasa mixed projection strategy. In addition, this algorithmpresented to verify the correlation of the association rules.also uses many efficient strategies of pruning and testingThe method verifies the generated rules through a selectedsubset. As for the aspect of running time, memory andtime period. Therefore, how to select the size of the timescalability, the Closet+ algorithm is superior to the otherperiod becomes an issue worth exploring. Tsukimoto'frequent closed itemsets mining algorithms. An efficientdeveloped a data mining technique called logical regressionalgorithm for closed itemnset mining (CHARM)3 uses theanalysis (LRA)，which consists of regression analysis andtransaction marking set in the vertical format to indicatethe approximation method. The LRA used nonparametricthe pattern support sets and does a two-way logo search forregression analysis in functional magnetic resonance imag-the itemsets and the transaction marking set; hence, it hasing (FMRI) to discover the rules of brain functions. Thea high pruning eficiency. The disadvantage of CHARM isregression equation is linear, so the effect of equation isits high storage overhead and low efficiency of projectionnot good. Liuet al.I 2 extended a generalized least-squaresoperation. For intensive data sets, CHARM has poor effi-method and a multivariate maximum likelihood method tociency and scalability. A new algorithm of mining associa-estimate the summarized nonlinear dose-response relationtion rules with multiple minimum supports was proposed intaking random effects into account. These methods were[4] and applied to an intelligent mining system for supperreadily suited to fit and test models with covariates andcurvilinear dose- response relations. Tsaur and Wanging with length decreasing support constraints. This ap-veloped a fuzzy regression model with a regression intervalproach integrates weight constraints and length decreasingthat was close to the sum of the radius values of collectedsupport constraints into the pattern growth algorithm andinterval data. The work enlarged the feasible region so thatcan generate more concise and important weighted frequentthe fuzzy regression interval might be possibly included inpatterns.the collected data by fuzzifying the constraints of the fuzzyAs for the correlation of the rules, researchers have putregression model within a given tolerance value. Finally, asatisfactory fuzzy regression equation with maximum pos-Manuscript received March 13, 2010; revised June 21, 2010sibility was derivedtion of China (No. J07240003, No. 60773084, No. 60603023) and Na-The above re中国煤化工esponding anal-tional Research Fund for the Doctoral Program of Higher Educationysis from the Pd relevance, butof China (No. 20070151009).YHCNMH G.W. G. Yiet al. / Regression Analysis of the Number of Association Rules79the researches about the number of rules are rare. There-tions of a number of itemsets. Therefore, the setting offore, this paper tries to analyze the given parameters tominsup will be a repeated adjustment process.investigate how to estimate the number of rules and designConfidence denotes the probability that somitemsets'better regression equations. At present, the most commonemergence will lead to others' occurrence. However, weframework of association rules is the support-confidence-notice that the confidence of association rule F→E onlyinterest (SCI) framework. Therefore, this paper will ana-considers the possibility of the case where E and F occurlyze the parameters' significance of this framework, takingsimultaneously, without considering the possibility of thento account the fact that the number of generated rulescase where only E occurs or the case whether E and F areis not possible to estimate. In the paper, we will design acorrelated. Consequently, many obtained association rulesvariety of regression equations to establish the relationshipmay be inffectivebetween the number of rules and the parameters and use theInterest(IS(A,B) = P(AB)/(P(A)P()))[)] expressesreal data of coronary heart disease to validate. Through thethe correlation between the antecedent and the consequenceobtained equations, we could estimate the number of rulesof a rule. If the value of the interest is greater than 1 de-under the given parameters and optimize the selection ofgree, then the antecedent and the consequence of a rule areparameters. Therefore, the establishment of the regressionpositively correlated. If the value of the interest is closeequations will play a better supporting role for the person-to 1，then the antecedent and the consequence are inde-nel to use this framework.pendent. If the value of the interest is below 1, then theantecedent and the consequence are negatively correlated.2 Problem statement of associationHowever, the value of interest is not really meaningful. Itjust qualitatively describes whether the antecedent and therules and parameters analysisconsequence of a rule are correlated or not. Therefore, theinterest can partially make up for the lack of confidence pa-2.1 Problem statementrameter. The following instances can explain the meaningHere, we give the general definition of association rules.of the interest.that I = {i1,i2,... ,im} is a binary set, in whichExample 1. Here, we give some supports of itemsets.parameters i1,i2,... ,im are called items. Assume thatP(A)=0.8, P(B)=0.8, P(AB)= 0.8,transaction T is the itemset with the restriction T∈I.Define D as the transaction set. Let X be the set con-P(C)=0.4, P(D)=0.4, P(CD)= 0.4.taining some items in I, and the transaction T include Xif X∈T. The length of the itemset is defined as theThe calculated results of interest are given as follows:number of the items in it. Association rules are those im-plications that can be depicted as X→Y, where X C I,IS(A,B)=- P(AB)= 1.25(P(A)P(B)) (0.8x 0.8)YCI,andX∩Y=重.ThesupportoftheruleX→Yinthe transaction database D is the ratio between the trans-IS(C, D)=(P(C)P(D))P(CD)(0.4x 0.4)= 2.5.action number including X and Y in the transaction setsand all the transaction number, which can be written asWe can find that A and B are always present or absentsupport(X→Y). That is to say, support(X→Y) =at the same time and so are C and D. However, their val-|{T : X∪Y E T, T∈D}|/IDI. The confidence of theues of interest are different. From the values of interest,rule X→Y in the transaction sets is the ratio betweenwe can learn that the itemsets with lower. appearing oc-the transaction number including X and Y and those in-currence probabilities have higher interest and the itemsetscluding X, which can be written as confidence(X→Y);with higher occurrence probabilities have lower interest.that is to say, confidence(X →Y) = {T: X∪Y∈T,Example 2. Here, we give some supports of itemsets.T∈D}l/{T: X∈T, T∈D}| To a certain transac-tion set D, mining association rules mean to figure out theP(A)=0.8，P(B)=0.8, P(AB) = 0.64,association rules whose support and confidence are higherP(C)=0.5，P(D)=0.8, P(CD)= 0.3.than or equal to the minimum support (minsup) and theminimum confidence (minconf), respectively.The results of interest are as follows:P(AB)0.642.2 Parameters analysisIS(A,B)= (P(A)P(B)) = 0.8x0.8) =1Support denotes the probability of frequent itemsets' OC-0.currence. However, it is dificult to determine the appro-IS(C, D):(P(C)P(D)) (0.5x0.8=0.75< 1.priate threshold of support in practical applications. If thevalue of the minsup is low, it will generate a large numberThrough the above calculations, we can find that A andof association rules, including many rules that we are notB are irrelated with each other, and C and D are negativelyinterested in. For example, there will be a large number ofcorrelative.such rules: an itemset with a low probability of emergenceTherefore, we argue that interest only measures the cor-leads to an itemset with a high probability of emergence.relation of rules from the perspective of quality. There is noSuch rules might not be useful for users. If the value ofbetter quantitative analysis of the correlation between theminsup is high, it will lose many low support patterns, parantecedent o” 中国煤化工aule Theresticularly those with long pattern and low support. Thesefore, many reseassociation rulespatterns may be of great significance for finding combina-use the supportMHCN M H Gork..80International Journal of Automation and Computing 8(1), February 20113 Design of the regression equationget the following equations group.3.1 Design equationsNβo+ 2 xralβ1+ 2工江2B2+亡xiB3+ 2 xisBa= 2 yiThe current framework of association rule is the support-2xi1βo+ 2罐B + 2 xi1xi2Bz + 2 xixisβ3+仁1confidence- interest framework. By investigating the framework, we can find the following facts. A smaller minsup2 xi1Ti4β4=二xi1Vileads to more number of rules; a bigger minsup leads toless number of rules. The range of minsup is (a, 1), and a2 xi25o+二Ti1xi2B1+二x路2B2 +二:i2i3β3+is the minsup value of the smallest probability. The effectof minconf is the same as minsup, but it has less infuence; Ti2Ti4B4= E xi2Yion the number of the rules than minconf. From the defini-tion confidence(A→B)= P(AB)/P(A), we can see that if二xi3Bo+ E xi1Xi3Br + E xi2Xi3532 + E x3$3+P(AB)= a and P(A)= 1, confidence(A→B) will obtainthe minimum value a. Therefore, the range of minconf is; Ti3DisBs= 2 xi3Yi(a, 1). Since the change of the parameter is merely relatedto qualitative analysis, the value of interest usually takes 1.二xi4Bo+ E xiNiaBr +二xi2xi4}2 + E xi3xi483+As a result, when designing the regression equation, we will台1directly add the coefficient of interest to the constant term.;唱β4=二Ti4Vi.According to the significance of parameters and the scatterdiagrams, we design a group of regression equations withThe values of coefficients (Bo, β, β2, β3, β4) can be ob-two variables, including the popular equations in designingtained by solving the equations group; then, regressionnon-linear regression equations. These equations are shownequation can be obtained.in (1)- (3), where x1 represents minsup, and x2 representsminconf.3.2 Test regression resultsWe use multiple correlation coefficient to evaluate theeffects of regression analysis.y=βo+β(司)+B(司)+内()+a(司)“The multiple correlation cofficient is defined asE(v;-的)(4)E(9。-元)2 .y=B+β六+3后+(()+(三)(2)x1√x2Here, 0≤r≤1. The closer the r to 1, the smaller theresidual error is, and the higher the validity of the modely=Bo+Bx1 +2x+2+3(-)+Bu(=). (3)4 Experiment4.1 Experimental data and the test of mul-After we make simple variable transformation for thetiple correlation coefficientabove (1)-(3), the model of linear equations can be obtainedsWe use the equations shown in Section 3.1 and two kindsof data sets from University of California Irvine (UCI) ma-chine learning repository, which are nursery and letter. Wey= Bo+β1x1 + B2x2 + β3.L3 + β4x4.use the association rule mining algorithm to get the num-ber of rules in the conditions of different support thresholds,According to the rule of the least-squares method, solve thediferent confidence: thresholds and interest threshold of 1,which are shown in Tables 1 and 2.residual error.In order to minimizeTable 1 Data of nurserySupportConfidenceThe number of rules0.080.6S路=(yi-9)2=0.651=10.1062624yi-(3o + βrxi1 + B2xi2 + B3sxi3 + βxi))).70.15中国煤化工we calculate the partial derivatives for the coefficients and0.18MYHCNMHG_ 2.w. G. Yiet al. / Regression Analysis of the Number of Association Rules81Table2 Data of letterTable 4 F test of regression equationsSupportConfidenceThe number of rulesp testData0.080.6712Equation (1)Equation (2)Equation (3)0.65010.787nursery803.814168933.654183951 85054536letter54. 13195557.26366257 .639158).1564.3 Prediction of regression equation0.150.0055Using the obtained regression equation, we compared0. 18the predicted number of rules with the actual number inthe condition of given parameters, such as minsup andminconf. The results are shown in Table 5. .According to the equations, we adopted multiple correla-tion cofficient to evaluate the effect of regression analysis.Table 5 Prediction for the data of letterThe obtained multiple correlation cofficients are shown inTable 3.ActualPredicted Predicted Predictednumbernumber numberminsup minconf numberof rulesof rules of rulesTable 3 Multiple correlation cofficients (MCC)with (1)with (2) with (3)EquationMCC of nurseryMCC of letterO.2502642650.9992230.9886500.2021060.9993310.9892613)0.9993440.989330From the data in the table, we can see that (1) acquiredrelatively good prediction. However, the forecast value and4.2 Significance testactual value were biased. As long as the size of the de-This paper selected (1)- (3) in the experiment and theviation was in the extent that we can accept, the results .data in Tables 1 and 2 to do the signifcance test. With (3)were feasible. Therefore, after comprehensive analysis, weand the data in Table 1, we took the significance test as anobtained a good regression equation, ie, (3). Through theexample. With (3) and the data of nursery, the obtainedoptimal equation, we can predict the number of rules underthe given parameters. At the same time, under the premiseregression equation was as follows:of estimating the number of generated rules, we could lusey= - 455.438697 + 212.068966x1 + 274.367816x2+the obtained optimal equation to optimize the selection ofparameters.8.724138(=) + 125203448(1)4.4 Eficiency analysisWe did the significance test to regression equation and veri-This paper requires multiple sets of samples in the cal-fied whether the cofficients of parameterssignificantlyculation of regression equation cofficients, so how to elficiently obtain samples will be a key issue. In order toWe hypothesized: Ho:β1 = β2= β3=β4=0.efficiently obtain the sample, we adopted the following man-We calculatedner. First， we select the first group of samples, whichare minsup1, minconf1, and minIS1, and make param-9=22.2，路= S(y;)2- 10y2 = 2411.6,eter values of the samples less than the parameter val-ues of other samples. In this way, we just need to mine=1through some efficient algorithms, all the frequent item-sets and the generated set of rules under the given param-S在=2(gi- y)”= 2408.437165,eters (minsup1, minconf1, minIS1), and retain the cor-i=1responding rule sets and rules corresponding to the val-ues of support, confidence, and interest. Second, for any路= S另- S = 3.162835.set of samples, its parameters satisfy minsupi≥minsUp1,So,minconfi≥minconfi, and minISi≥minIS1. We do notneed to re-scan the database to determine frequent item-远sets and generated rules. Instead, we only need to find= 951.850545.the rules that meet the parameters of minsupi, minconfi,minIS; in the set of rules obtained from the first group ofN-M-1samples and retain them. So, we can quickly get the num-Let the significance level a = 0.05, and by the F distri-ber of rules under any set of samples without scanning thebution table Fo.95(4,5)= 5.19.database. Therefore, it can greatly enhance the efficiencyObviously, F = 951.850545 > Fo.95(4,5)= 5.19, so wein obtaining the parameters samples.refused Ho. It showed that the cofficients of regressionequation were not significantly zero, that is to say, all x1,5 Conclusionsx2, xr3, x4 had significant efects on y.We used the above mode to calculate the F test of theThis paper中国煤化工。parameters inother equations, and the results are shown in Table 4.the commonlyMHC N M H Gssociation rules..32International Journal of Automation and Computing 8(1), February 2011Under the support-confidence interestingness framework,[10] D. Q. Ye, S. L. Zhao. Correlation technique research of as-we have designed a number of regression equations. Ac-sociation rule based on linear regression. Journal of Com-cording to the obtained regression equations, we can pre-puter Research and Development, vol. 45, no. 21, pp. 291-294, 2008. (in Chinese)dict the number of generated rules in the condition of givenparameters and effectively solve the problem of the gener-[11] H. Tsukimoto. Logical regression analysis: from mathemat-ical formulas to linguistic rules. Studies in Fuzziness andation of a great number of rules when the parameters areSoft Computing, vol. 180, Pp. 21- 61, 2005.set unreasonably, and then achieve the goal of optimizingparameter settings. At the same time, when the parameters[12] Q. Liu, N. R. Cook, A. Bergstrom, C. C. Hsieh. A two-stageare set, we get the following conclusions: under the premisehierarchical regression model for meta-analysis of epidemi-ologic nonlinear dose-response data. Computational Statis-of estimating the number of generated rules, through thetics & Data Analysis, vol. 53, no. 12, PP. 4157- -4167, 2009.selected optimal equation, we estimate parameters accord-ing to the following order: minI S, minconf, which changes[13] R. C. Tsaur, H. F. Wang. Necessity analysis of fuzzy re-from large to small, minsup. Usually, minIS is set to 1,gression equations using a fuzzy goal programming model.International Journal of Fuzzy Systems， vol.11， no. 2,and minconf has priority to select a higher value, so thatPP. 107- 115, 2009.the obtained rules can be more meaningful. Therefore, bysetting parameters according to the above-mentioned order,[14] N. Duan, F. N. Hu, X. Yu. An improved control algorithmwe can get better parameter settings.for high-order nonlinear systems with unmodelled dynam-cs. International Journal of Automation and Computing,Future research goals include 1) regression equation asvol.6, no.3, PP. 234- -239, 2009.applied to the other data set, and its predicting effect, suchas application in nonlinear systeml4,15, and 2) testing the[15] X. Y. Luo, Z. H. Zhu, X. P. Guan. Adaptive fuzzy dy-changes in the relativity of rules under diferent time usingnamic surface control for uncertain nonlinear systems. In-the thought of non-linear regression equation.ternational Journal of Automation and Computing, vol.6,no.4, pp. 385- -390, 2009.Referencesl] R. Agrawal, T. Imielinski, A. Swami. Mining associationWei-Guo Yi graduated from Northeastrules between sets of items in large databases. In Proceed-Normal University, PRC in 2002. He reings of ACM SIGMOD International Conference on Man-ceived the M.Sc. degree from Northeastagement of Data, ACM, Washington DC, USA, PP. 207-Normal University in 2005. He is a lec-216, 1993.turer of Dalian Jiaotong University, PRC.Currently, he is a Ph.D. candidate at the[2] J. Wang, J. Han, J. Pei. Closet+: Searching for the bestSchool of Information Science and Technol-ogy, Dalian Maritime University, Dalian,noaotthe oth ACO SICKDDInternationgl Conter encoonKnowledge Discovery and Data Mining, ACM, WashingtonHis research interests include data min-DC, USA, pp.236- 245, 2003.ing and pattern recognition.[3] M. J. Zaki, C. Hsiao. Charm: An eficient algorithm forE-mail: jiekexun98@ 163.com (Corresponding author)closed itemset mining. In Proceedings of the 2nd SIAM In-ternational Conference on Data Mining, Arlington, USA,pp.12- 28, 2002.Ming-Yu Lu graduated from Hei-longjiang University of Computer Sof[4] M. Li. Application of mining association rules with multipleware, PRC in 1985. He received the M. Sc.minimum supports in sales data. Computer Engineering,degree in 1988 and the Ph.D. degree invol.23, no.8, PP. 9293, 2003. (in Chinese)2003，both from Tsinghua University,] U. Yun. An eficient mining of weighted frequent patternPRC. He is a senior member of the ChinaComputer Federation and professor owith length-decreasing support constraints. Knowledge-based Systems, vol.21, no.8, PP. 741-752, 2008.E. R. Omiecinski. Alternative interest measures for mininging, text mining, and Web mining.iations in Hatahases， TEEE Transactionson KnowledoeE- mail: lumingyu@tsinghua.org.cnand Data Engineering, vol. 15, no. 1, pp.57- 69, 2003.[7] s. Brin, R. Motwani, C. Silverstein. Beyond market baskets:Zhi Liu graduated from Dalian Univer-Generalizing association rules to correlations. In Proceed-sity, PRC in 1995. She received the M. Sc.degree from Dalian University of Technol-agement of Data, ACM, Tucson, USA, pp. 265- -276, 1997ogy, PRC in 1999. She is an associate pro-[8] K. M. Ahmed, N. M. E. Makky, Y. Taha. A note on be-fessor of Dalian Maritime University, PRC.Currently, she is a Ph. D. candidate at theyond market baskets: Generalizing association rules to cor-relations. ACM SIGKDD Explorations Newsletter, vol. 1,no.2, pP.46- -48, 2000.ogy, Dalian Maritime University.Her research interests include data min-[9] Z. He, H. K. Huang, S. F. Tian. An approach to findinging and artificial intelligence.optimized correlated association rules. Chinese Journal ofE-mail: lzsgmsc@ 126. comComputers, vol. 29, no. 6, pp. 906- 913, 2006. (in Chinese)中国煤化工MHCNMH G.

论文截图