Nagiza F. Samatova1 ,
George Ostrouchov1 , Al Geist1
and Anatoli V. Melechko2
(1) |
Computer Science and Mathematics
Division, Oak Ridge National
Laboratory, P.O. Box 2008, Oak
Ridge, TN 37831, USA |
(2) |
Oak
Ridge National Laboratory,
Molecular-Scale Engineering and
Nanoscale Technologies Group, P.O.
Box 2008, Oak Ridge, TN 37831, USA |
Abstract This
paper presents a hierarchical clustering
method named RACHET (Recursive
Agglomeration of Clustering Hierarchies
by Encircling Tactic) for analyzing
multi-dimensional distributed data. A
typical clustering algorithm requires
bringing all the data in a centralized
warehouse. This results in O(nd)
transmission cost, where n is the
number of data points and d is
the number of dimensions. For large
datasets, this is prohibitively
expensive. In contrast, RACHET runs with
at most O(n) time, space,
and communication costs to build a
global hierarchy of comparable
clustering quality by merging locally
generated clustering hierarchies. RACHET
employs the encircling tactic in which
the merges at each stage are chosen so
as to minimize the volume of a covering
hypersphere. For each cluster centroid,
RACHET maintains descriptive statistics
of constant complexity to enable these
choices. RACHET's framework is
applicable to a wide class of centroid-based
hierarchical clustering algorithms, such
as centroid, medoid, and Ward.
clustering distributed
datasets - distributed data mining
|