INFORMS Journal on Computing
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


INFORMS JOURNAL ON COMPUTING
Vol. 20, No. 4, Fall 2008, pp. 539-552
DOI: 10.1287/ijoc.1080.0265
This Article
Right arrow Full Text (PDF)
Right arrow References
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Xiong, H.
Right arrow Articles by Ma, S.
Right arrow Search for Related Content

Top-k {phi} Correlation Computation

Hui Xiong, Wenjun Zhou, Mark Brodie, Sheng Ma

Management Science and Information Systems Department, Rutgers Business School, Rutgers University, Newark, New Jersey 07102
Management Science and Information Systems Department, Rutgers Business School, Rutgers University, Newark, New Jersey 07102
IBM T.J. Watson Research Center, Hawthorne, New York 10532
DoubleClick Inc., New York, New York 10011

hxiong{at}rutgers.edu
wjzhou{at}pegasus.rutgers.edu
mbrodie{at}us.ibm.com
shengma2005{at}gmail.com

Recently, there has been considerable interest in efficiently computing strongly correlated pairs in large databases. Most previous studies require the specification of a minimum correlation threshold to perform the computation. However, it may be difficult for users to provide an appropriate threshold in practice because different data sets typically have different characteristics. To this end, in this paper, we propose an alternative task: finding the top-k strongly correlated pairs. Consequently, we identify a two-dimensional monotone property of an upper bound of {phi} correlation coefficient and develop an efficient algorithm, called TOP-COP, to exploit this property to effectively prune many pairs even without computing their correlation coefficients. Our experimental results show that TOP-COP can be an order of magnitude faster than alternative approaches for mining the top-k strongly correlated pairs. Finally, we show that the performance of the TOP-COP algorithm is tightly related to the degree of data dispersion. Indeed, the higher the degree of data dispersion, the larger the computational savings achieved by the TOP-COP algorithm.

Key words: {phi} correlation coefficient; Pearson's correlation coefficient; statistical computing
History: received December 2006; accepted December 2007.







HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2008 by INFORMS.