|
|
Line 1: |
Line 1: |
| '''Density-based spatial clustering of applications with noise''' ('''DBSCAN''') is a [[data clustering]] algorithm proposed by Martin Ester, [[Hans-Peter Kriegel]], Jörg Sander and Xiaowei Xu in 1996.<ref>{{Cite conference
| | Andera is what you can contact her but she by no means really liked that title. My email psychic readings ([http://ustanford.com/index.php?do=/profile-38218/info/ ustanford.com]) spouse doesn't like it the way I do but what I really like performing is caving but I don't have the time recently. Credit authorising is where my main income comes from. My spouse and I reside in Mississippi and I adore each working day residing here.<br><br>Also visit my blog :: [http://chungmuroresidence.com/xe/reservation_branch2/152663 online psychics] reader [[http://203.250.78.160/zbxe/?document_srl=1792908 http://203.250.78.160/zbxe/?document_srl=1792908]] |
| | author = Martin Ester, [[Hans-Peter Kriegel]], Jörg Sander, Xiaowei Xu
| |
| | title = A density-based algorithm for discovering clusters in large spatial databases with noise
| |
| | pages = 226–231
| |
| | editors = Evangelos Simoudis, Jiawei Han, Usama M. Fayyad
| |
| | booktitle = Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) | |
| | publisher = [[AAAI Press]]
| |
| | year = 1996-
| |
| | isbn = 1-57735-004-9
| |
| | url = http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.1980
| |
| }}</ref>
| |
| It is a [[Cluster_analysis#Density-based_clustering|density-based clustering]] algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.<ref>[http://academic.research.microsoft.com/CSDirectory/paper_category_7.htm] Most cited data mining articles according to Microsoft academic search; DBSCAN is on rank 24, when accessed on: 4/18/2010</ref> [[OPTICS algorithm|OPTICS]] can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the <math>\varepsilon</math> parameter with a maximum search radius.
| |
| | |
| ==Basic idea==
| |
| [[File:DBSCAN-Illustration.svg|thumb|400px|Points at A are core points. Points B and C are ''density-reachable'' from A and thus ''density-connected'' and belong to the same cluster. Point N is a noise point that is neither a core point nor density-reachable. (MinPts=3 or MinPts=4)]]
| |
| | |
| DBSCAN's definition of a cluster is based on the notion of ''density reachability''. Basically, a point <math>q</math> is ''directly density-reachable'' from a point <math>p</math> if it is not farther away than a given distance <math>\varepsilon</math> (i.e., is part of its <math>\varepsilon</math>-neighborhood) and if <math>p</math> is surrounded by sufficiently many points such that one may consider <math>p</math> and <math>q</math> to be part of a cluster. <math>q</math> is called ''density-reachable'' (note the distinction from "''directly'' density-reachable") from <math>p</math> if there is a sequence <math>p_1,\ldots,p_n</math> of points with <math>p_1 = p</math> and <math>p_n = q</math> where each <math>p_{i+1}</math> is directly density-reachable from <math>p_i</math>.
| |
| | |
| Note that the relation of density-reachable is not symmetric. <math>q</math> might lie on the edge of a cluster, having insufficiently many neighbors to count as dense itself. This would halt the process of finding a path that stops with the first non-dense point. By contrast, starting the process with <math>p</math> ''would'' lead to <math>q</math> (though the process would halt there, <math>q</math> being the first non-dense point). Due to this asymmetry, the notion of ''density-connected'' is introduced: two points <math>p</math> and <math>q</math> are density-connected if there is a point <math>o</math> such that both <math>p</math> and <math>q</math> are density-reachable from <math>o</math>. Density-connectedness ''is'' symmetric.
| |
| | |
| A cluster, which is a subset of the points of the database, satisfies two properties:
| |
| # All points within the cluster are mutually density-connected.
| |
| # If a point is density-connected to any point of the cluster, it is part of the cluster as well.
| |
| | |
| ==Algorithm==
| |
| DBSCAN requires two parameters: <math>\varepsilon</math> (eps) and the minimum number of points required to form a cluster (minPts). It starts with an arbitrary starting point that has not been visited. This point's <math>\varepsilon</math>-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized <math>\varepsilon</math>-environment of a different point and hence be made part of a cluster.
| |
| | |
| If a point is found to be a dense part of a cluster, its <math>\varepsilon</math>-neighborhood is also part of that cluster. Hence, all points that are found within the <math>\varepsilon</math>-neighborhood are added, as is their own <math>\varepsilon</math>-neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.
| |
| | |
| ==Pseudocode==
| |
| | |
| DBSCAN(D, eps, MinPts)
| |
| C = 0
| |
| for each unvisited point P in dataset D
| |
| mark P as visited
| |
| NeighborPts = regionQuery(P, eps)
| |
| if sizeof(NeighborPts) < MinPts
| |
| mark P as NOISE
| |
| else
| |
| C = next cluster
| |
| expandCluster(P, NeighborPts, C, eps, MinPts)
| |
|
| |
| expandCluster(P, NeighborPts, C, eps, MinPts)
| |
| add P to cluster C
| |
| for each point P' in NeighborPts
| |
| if P' is not visited
| |
| mark P' as visited
| |
| NeighborPts' = regionQuery(P', eps)
| |
| if sizeof(NeighborPts') >= MinPts
| |
| NeighborPts = NeighborPts joined with NeighborPts'
| |
| if P' is not yet member of any cluster
| |
| add P' to cluster C
| |
|
| |
| regionQuery(P, eps)
| |
| return all points within P's eps-neighborhood (including P)
| |
| | |
| ==Complexity==
| |
| DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters). For practical considerations, however, the time complexity is mostly governed by the number of regionQuery invocations. DBSCAN executes exactly one such query for each point, and if an [[spatial_index|indexing structure]] is used that executes such a [[fixed-radius near neighbors|neighborhood query]] in <math>O(\log n)</math>, an overall runtime complexity of <math>O(n \cdot \log n)</math> is obtained. Without the use of an accelerating index structure, the run time complexity is <math>O(n^2)</math>. Often the distance matrix of size <math>(n^2-n)/2</math> is materialized to avoid distance recomputations. This however also needs <math>O(n^2)</math> memory.
| |
| | |
| [[File:DBSCAN-density-data.svg|thumb|DBSCAN can find non-linearly separable clusters. This dataset cannot be adequately clustered with k-means or Gaussian Mixture EM clustering.]]
| |
| | |
| ==Advantages==
| |
| | |
| # DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to [[K-means algorithm|k-means]].
| |
| # DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.
| |
| # DBSCAN has a notion of noise.
| |
| # DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.)
| |
| # DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an [[R* tree]].
| |
| | |
| ==Disadvantages==
| |
| | |
| # The quality of DBSCAN depends on the [[Metric (mathematics)|distance measure]] used in the function regionQuery(P,<math>\varepsilon</math>). The most common distance metric used is [[Euclidean distance]]. Especially for [[clustering high-dimensional data|high-dimensional data]], this metric can be rendered almost useless due to the so-called "[[Curse of dimensionality]]", making it difficult to find an appropriate value for <math>\varepsilon</math>. This effect, however, is also present in any other algorithm based on Euclidean distance.
| |
| # DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-<math>\varepsilon</math> combination cannot then be chosen appropriately for all clusters.
| |
| | |
| See the section below on extensions for algorithmic modifications to handle these issues.
| |
| | |
| ==Parameter estimation==
| |
| | |
| Every data mining task has the problem of parameters. Every parameter influences the algorithm in specific ways. For DBSCAN, the parameters ε and ''minPts'' are needed. The parameters must be specified by the user.
| |
| | |
| * ''MinPts'': As a rule of thumb, a minimum ''minPts'' can be derived from the number of dimensions ''D'' in the data set, as ''minPts≥D+1''.<br /> The low value of ''minPts=1'' does not make sense, as then every point on its own will already be a cluster. With ''minPts=2'', the result will be the same as of [[hierarchical clustering]] with the single link metric, with the dendrogram cut at height ε (however, DBSCAN is substantially faster, because it does not compute the full dendrogram and can use indexes).<br /> However, larger values are usually better for data sets with noise and will yield more significant clusters. The larger the data set, the larger the value of ''minPts'' should be chosen.
| |
| * ε: The value for ε can then be chosen by using a [[k-distance graph]], plotting the distance to the ''k=minPts'' nearest neighbor. Good values of ε are where this plot shows a strong bend: if ε is chosen too small, a large part of the data will not be clustered; whereas for a too high value of ε, clusters will merge and the majority of objects will be in the same cluster.
| |
| | |
| [[OPTICS algorithm|OPTICS]] can be seen as a generalization of DBSCAN that replaces the <math>\varepsilon</math> parameter with a maximum value that mostly affects performance. ''MinPts'' then essentially becomes the minimum cluster size to find. While the algorithm is much easier to parameterize than DBSCAN, the results are a bit more difficult to use, as it will usually produce a hierarchical clustering instead of the simple data partitioning that DBSCAN produces. | |
| | |
| Recently, one of the original authors of DBSCAN has revisited DBSCAN and OPTICS, and published a refined version of hierarchical DBSCAN (HDBSCAN*),<ref name="hdbscan1">{{cite doi | 10.1007/978-3-642-37456-2_14 }}</ref><ref name="hdbscan2">{{cite doi | 10.1007/s10618-013-0311-4}}</ref> which no longer has the notion of border points.
| |
| | |
| == Generalization ==
| |
| | |
| Generalized DBSCAN or '''GDBSCAN''' <ref>{{Cite journal
| |
| | first1 = Jörg | last1 = Sander
| |
| | first2 = Martin | last2 = Ester
| |
| | first3 = Hans-Peter | last3 = Kriegel | authorlink3 =Hans-Peter Kriegel
| |
| | first4 = Xiaowei | last4 = Xu
| |
| | title = Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications
| |
| | pages = 169–194
| |
| | booktitle = Data Mining and Knowledge Discovery
| |
| | publisher = [[Springer-Verlag]]
| |
| | place = Berlin
| |
| | year = 1998
| |
| | doi = 10.1023/A:1009745219419
| |
| | url = http://www.springerlink.com/content/n22065n21n1574k6
| |
| | journal = Data Mining and Knowledge Discovery
| |
| | volume = 2
| |
| | issue = 2
| |
| }}</ref><ref>{{Cite book | first = Jörg | last = Sander | title = Generalized Density-Based Clustering for Spatial Data Mining | ISBN=3-89675-469-6 | year=1998 | place = München | publisher = Herbert Utz Verlag }}</ref> is a generalization by the same authors to arbitrary "neighborhood" and "dense" predicates. The <math>\varepsilon</math> and minpts parameters are removed from the original algorithm and moved to the predicates. For example on polygon data, the "neighborhood" could be any intersecting polygon, whereas the density predicate uses the polygon areas instead of just the object count.
| |
| | |
| ==Extensions==
| |
| | |
| Various extensions to the DBSCAN algorithm have been proposed, including methods for parallelization, parameter estimation and support for uncertain data. The basic idea has been extended to hierarchical clustering by the [[OPTICS algorithm]]. DBSCAN is also used as part of subspace clustering algorithms like [[PreDeCon]] and [[SUBCLU]]. HDBSCAN<ref name="hdbscan1" /> is a revisited version of DBSCAN and OPTICS, from which informative clusters can be extracted using heuristics.<ref name="hdbscan2" />
| |
| | |
| ==Availability==
| |
| | |
| An implementation of DBSCAN is available in the [[ELKI]] framework. Notice that this implementation is not optimized for speed but for extensibility. Thus, this implementation can use various index structures for sub-quadratic performance and supports various distance functions and arbitrary data types, but it may be outperformed by low-level optimized implementations on small data sets.
| |
| | |
| [[scikit-learn]] includes a Python implementation of DBSCAN for arbitrary [[Minkowski distance|Minkowski metrics]], based on [[kd-tree]]s and [[ball tree]]s.
| |
| | |
| [[R (programming language)|GNU R]] contains DBSCAN in the "fpc" package with support for arbitrary distance functions via distance matrixes. However it does not have index support (and thus has quadratic runtime complexity).
| |
| | |
| <!-- before adding yet another implementation, verify that it is of notability. Wikipedia is not a collection of links! -->
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| | |
| ===Further reading===
| |
| *{{cite conference
| |
| | author = Domenica Arlia, Massimo Coppola
| |
| | publisher=Springer Berlin
| |
| | accessdate=2004-02-19
| |
| | title=Experiments in Parallel Clustering with DBSCAN
| |
| | booktitle = Euro-Par 2001: Parallel Processing: 7th International Euro-Par Conference Manchester, UK August 28–31, 2001, Proceedings
| |
| }}
| |
| *{{Cite journal
| |
| | author = [[Hans-Peter Kriegel]], Peer Kröger, Jörg Sander, Arthur Zimek
| |
| | title = Density-based Clustering
| |
| | journal = WIREs Data Mining and Knowledge Discovery
| |
| | volume = 1
| |
| | issue = 3
| |
| | year = 2011
| |
| | pages = 231–240
| |
| | url = http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WIDM30.html
| |
| | doi = 10.1002/widm.30
| |
| }}
| |
| | |
| {{DEFAULTSORT:Dbscan}}
| |
| [[Category:Data clustering algorithms]]
| |