Alan M. Frieze: Difference between revisions
en>Tabletop m Spell Americam => American (33) (probable spell error needing correction) |
en>Omnipaedista per WP:CREDENTIAL |
||
Line 1: | Line 1: | ||
'''Ordering points to identify the clustering structure''' ('''OPTICS''') is an algorithm for finding density-based [[Cluster analysis|clusters]] in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, [[Hans-Peter Kriegel]] and Jörg Sander.<ref>{{Cite conference | |||
| author = Mihael Ankerst, Markus M. Breunig, [[Hans-Peter Kriegel]], Jörg Sander | |||
| title = OPTICS: Ordering Points To Identify the Clustering Structure | |||
| year = 1999 | |||
| pages = 49–60 | |||
| booktitle = ACM SIGMOD international conference on Management of data | |||
| publisher = [[ACM Press]] | |||
| url = http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.6542 | |||
}}</ref> | |||
Its basic idea is similar to [[DBSCAN]],<ref>{{Cite conference | |||
| author = Martin Ester, [[Hans-Peter Kriegel]], Jörg Sander, Xiaowei Xu | |||
| title = A density-based algorithm for discovering clusters in large spatial databases with noise | |||
| pages = 226–231 | |||
| editors = Evangelos Simoudis, Jiawei Han, Usama M. Fayyad | |||
| booktitle = Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) | |||
| publisher = [[AAAI Press]] | |||
| year = 1996 | |||
| isbn = 1-57735-004-9 | |||
| url = http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.1980 | |||
}}</ref> but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. In order to do so, the points of the database are (linearly) ordered such that points which are spatially closest become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster. This is represented as a [[dendrogram]]. | |||
==Basic idea== | |||
Like [[DBSCAN]], OPTICS requires two parameters: <math>\varepsilon</math>, which describes the maximum distance (radius) to consider, and <math>MinPts</math>, describing the number of points required to form a cluster. A point <math>p</math> is a ''core point'' if at least <math>MinPts</math> points are found within its <math>\varepsilon</math>-neighborhood <math>N_\varepsilon(p)</math>. Contrary to [[DBSCAN]], OPTICS also considers points that are part of a more densely packed cluster, so each point is assigned a ''core distance'' that describes the distance to the <math>MinPts</math>th closest point: | |||
:<math>\text{core-distance}_{\varepsilon,MinPts}(p)=\begin{cases}\text{UNDEFINED} & \text{if } |N_\varepsilon(p)| < MinPts\\ \text{distance to the } MinPts\text{-th closest point} & \text{otherwise}\end{cases}</math> | |||
The ''reachability-distance'' of another point <math>o</math> from a point <math>p</math> is the distance between <math>o</math> and <math>p</math>, or the core distance of <math>p</math>: | |||
:<math>\text{reachability-distance}_{\varepsilon,MinPts}(o,p) = \begin{cases}\text{UNDEFINED} & \text{if } |N_\varepsilon(p)| < MinPts\\ \max(\text{core-distance}_{\varepsilon,MinPts}(p), \text{distance}(p,o)) & \text{otherwise}\end{cases}</math> | |||
If <math>p</math> and <math>o</math> are nearest neighbors, this is the <math>\varepsilon' < \varepsilon</math> we need to assume in order to have <math>p</math> and <math>o</math> belong to the same cluster. | |||
Both the core-distance and the reachability-distance are undefined if no sufficiently dense cluster (w.r.t. <math>\varepsilon</math>) is available. Given a sufficiently large <math>\varepsilon</math>, this will never happen, but then every <math>\varepsilon</math>-neighborhood query will return the entire database, resulting in <math>O(n^2)</math> runtime. Hence, the <math>\varepsilon</math> parameter is required to cut off the density of clusters that is no longer considered to be interesting and to speed up the algorithm this way. | |||
The parameter <math>\varepsilon</math> is strictly speaking not necessary. It can be set to a maximum value. When a spatial index is available, it does however play a practical role when it comes to complexity. It is often claimed that OPTICS abstracts from DBSCAN by removing this parameter, at least to the amount of only having to give a maximum value. | |||
==Pseudocode== | |||
The basic approach of OPTICS is similar to [[DBSCAN]], but instead of maintaining a set of known, but so far unprocessed cluster members, a [[priority queue]] (e.g. using an indexed [[Heap (data structure)|heap]]) is used. | |||
OPTICS(DB, eps, MinPts) | |||
for each point p of DB | |||
p.reachability-distance = UNDEFINED | |||
for each unprocessed point p of DB | |||
N = getNeighbors(p, eps) | |||
mark p as processed | |||
output p to the ordered list | |||
Seeds = empty priority queue | |||
if (core-distance(p, eps, Minpts) != UNDEFINED) | |||
update(N, p, Seeds, eps, Minpts) | |||
for each next q in Seeds | |||
N' = getNeighbors(q, eps) | |||
mark q as processed | |||
output q to the ordered list | |||
if (core-distance(q, eps, Minpts) != UNDEFINED) | |||
update(N', q, Seeds, eps, Minpts) | |||
In update(), the priority queue Seeds is updated with the <math>\varepsilon</math>-neighborhood of <math>p</math> and <math>q</math>, respectively: | |||
update(N, p, Seeds, eps, Minpts) | |||
coredist = core-distance(p, eps, MinPts) | |||
for each o in N | |||
if (o is not processed) | |||
new-reach-dist = max(coredist, dist(p,o)) | |||
if (o.reachability-distance == UNDEFINED) // o is not in Seeds | |||
o.reachability-distance = new-reach-dist | |||
Seeds.insert(o, new-reach-dist) | |||
else // o in Seeds, check for improvement | |||
if (new-reach-dist < o.reachability-distance) | |||
o.reachability-distance = new-reach-dist | |||
Seeds.move-up(o, new-reach-dist) | |||
OPTICS hence outputs the points in a particular ordering, annotated with their smallest reachability distance (in the original algorithm, the core distance is also exported, but this is not required for further processing). | |||
==Extracting the clusters== | |||
[[Image:OPTICS.png|frame|right|Using a reachability plot to identify clusters]] | |||
Using a ''reachability-plot'' (a special kind of [[dendrogram]]), the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points on the x-axis and the reachability distance on the y-axis. Since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster. | |||
The image on the right illustrates this concept. In its upper half, an artificial example of a database consisting of two-dimensional, spatial points is shown. The lower part shows the reachability plot as computed by OPTICS. The black lines link some clusters to their respective valleys. The horizontal red line is an example on how to obtain a clustering. Each valley it crosses is made a cluster of its own. If the line was moved down, more clusters would emerge, especially for the topmost cluster, which features varying densities. | |||
Note that deriving clusters in such a way yields the same result on core points of running [[DBSCAN]] on the data with <math>\varepsilon</math> set to the chosen reachability-distance threshold. The assignment of non-core points to neighboring clusters is non-deterministic in DBSCAN, too. | |||
The blue points in this image are considered noise, and no valley is found in their reachability plot. This is subject to the <math>\varepsilon</math> parameter, which bounds the density of clusters. | |||
A more advanced analysis does not use a specific value of <math>\varepsilon</math>, but instead looks for spikes that separate clusters. This can be used to obtain a [[hierarchical clustering]] that cannot be achieved by a single DBSCAN run. | |||
==Complexity== | |||
Like [[DBSCAN]], OPTICS processes each point once, and performs one [[fixed-radius near neighbors|<math>\varepsilon</math>-neighborhood query]] during this processing. Given a [[spatial index]] that grants a neighborhood query in <math>O(\log n)</math> runtime, an overall runtime of <math>O(n \cdot \log n)</math> is obtained. The authors of the original OPTICS paper report an actual constant slowdown factor of 1.6 compared to DBSCAN. Note that the value of <math>\varepsilon</math> might heavily influence the cost of the algorithm, since a value too large might raise the cost of a neighborhood query to linear complexity. | |||
In particular, choosing <math>\varepsilon > \max_{x,y} d(x,y)</math> (larger than the maximum distance in the data set) is possible, but will obviously lead to quadratic complexity, since every neighborhood query will return the full data set. Even when no spatial index is available, this comes at additional cost in managing the heap. Therefore, <math>\varepsilon</math> should be chosen appropriately for the data set. | |||
==Extensions== | |||
OPTICS-OF<ref>{{Cite book | |||
| author = Markus M. Breunig, [[Hans-Peter Kriegel]], Raymond T. Ng and Jörg Sander | |||
| chapter = OPTICS-OF: Identifying Local Outliers | |||
| pages = 262–270 | |||
| title = Principles of Data Mining and Knowledge Discovery | |||
| publisher = [[Springer-Verlag]] | |||
| year = 1999 | |||
| isbn = 978-3-540-66490-1 | |||
| doi = 10.1007/b72280 | |||
| chapterurl = http://springerlink.metapress.com/content/76bx6413gqb4tvta/ | |||
}}</ref> is an [[anomaly detection|outlier detection]] algorithm based on OPTICS. The main use is the extraction of outliers from an existing run of OPTICS at low cost compared to using a different outlier detection method. | |||
DeLi-Clu,<ref>{{cite doi | 10.1007/11731139_16}}</ref> Density-Link-Clustering combines ideas from [[single-linkage clustering]] and OPTICS, eliminating the <math>\varepsilon</math> parameter and offering performance improvements over OPTICS. | |||
HiSC<ref>{{cite doi | 10.1007/11871637_42}}</ref> is a hierarchical [[subspace clustering]] (axis-parallel) method based on OPTICS. | |||
HiCO<ref>{{cite doi | 10.1109/SSDBM.2006.35}}</ref> is a hierarchical [[correlation clustering]] algorithm based on OPTICS. | |||
DiSH<ref>{{cite doi | 10.1007/978-3-540-71703-4_15}}</ref> is an improvement over HiSC that can find more complex hierarchies. | |||
==Availability== | |||
Implementations of OPTICS, OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH are available in the [[Environment for DeveLoping KDD-Applications Supported by Index-Structures|ELKI data mining framework]] (with index acceleration). An incomplete and slow implementation can be found in the [[Weka (machine learning)|Weka]] extensions. The MRC National Institute for Medical Research provides a [http://mathbio.nimr.mrc.ac.uk/wiki/Software#OPTICS C reimplementation of OPTICS] without index support. | |||
==References== | |||
{{reflist}} | |||
[[Category:Data clustering algorithms]] |
Revision as of 04:56, 30 October 2013
Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander.[1] Its basic idea is similar to DBSCAN,[2] but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. In order to do so, the points of the database are (linearly) ordered such that points which are spatially closest become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster. This is represented as a dendrogram.
Basic idea
Like DBSCAN, OPTICS requires two parameters: , which describes the maximum distance (radius) to consider, and , describing the number of points required to form a cluster. A point is a core point if at least points are found within its -neighborhood . Contrary to DBSCAN, OPTICS also considers points that are part of a more densely packed cluster, so each point is assigned a core distance that describes the distance to the th closest point:
The reachability-distance of another point from a point is the distance between and , or the core distance of :
If and are nearest neighbors, this is the we need to assume in order to have and belong to the same cluster.
Both the core-distance and the reachability-distance are undefined if no sufficiently dense cluster (w.r.t. ) is available. Given a sufficiently large , this will never happen, but then every -neighborhood query will return the entire database, resulting in runtime. Hence, the parameter is required to cut off the density of clusters that is no longer considered to be interesting and to speed up the algorithm this way.
The parameter is strictly speaking not necessary. It can be set to a maximum value. When a spatial index is available, it does however play a practical role when it comes to complexity. It is often claimed that OPTICS abstracts from DBSCAN by removing this parameter, at least to the amount of only having to give a maximum value.
Pseudocode
The basic approach of OPTICS is similar to DBSCAN, but instead of maintaining a set of known, but so far unprocessed cluster members, a priority queue (e.g. using an indexed heap) is used.
OPTICS(DB, eps, MinPts) for each point p of DB p.reachability-distance = UNDEFINED for each unprocessed point p of DB N = getNeighbors(p, eps) mark p as processed output p to the ordered list Seeds = empty priority queue if (core-distance(p, eps, Minpts) != UNDEFINED) update(N, p, Seeds, eps, Minpts) for each next q in Seeds N' = getNeighbors(q, eps) mark q as processed output q to the ordered list if (core-distance(q, eps, Minpts) != UNDEFINED) update(N', q, Seeds, eps, Minpts)
In update(), the priority queue Seeds is updated with the -neighborhood of and , respectively:
update(N, p, Seeds, eps, Minpts) coredist = core-distance(p, eps, MinPts) for each o in N if (o is not processed) new-reach-dist = max(coredist, dist(p,o)) if (o.reachability-distance == UNDEFINED) // o is not in Seeds o.reachability-distance = new-reach-dist Seeds.insert(o, new-reach-dist) else // o in Seeds, check for improvement if (new-reach-dist < o.reachability-distance) o.reachability-distance = new-reach-dist Seeds.move-up(o, new-reach-dist)
OPTICS hence outputs the points in a particular ordering, annotated with their smallest reachability distance (in the original algorithm, the core distance is also exported, but this is not required for further processing).
Extracting the clusters
Using a reachability-plot (a special kind of dendrogram), the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points on the x-axis and the reachability distance on the y-axis. Since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.
The image on the right illustrates this concept. In its upper half, an artificial example of a database consisting of two-dimensional, spatial points is shown. The lower part shows the reachability plot as computed by OPTICS. The black lines link some clusters to their respective valleys. The horizontal red line is an example on how to obtain a clustering. Each valley it crosses is made a cluster of its own. If the line was moved down, more clusters would emerge, especially for the topmost cluster, which features varying densities.
Note that deriving clusters in such a way yields the same result on core points of running DBSCAN on the data with set to the chosen reachability-distance threshold. The assignment of non-core points to neighboring clusters is non-deterministic in DBSCAN, too.
The blue points in this image are considered noise, and no valley is found in their reachability plot. This is subject to the parameter, which bounds the density of clusters.
A more advanced analysis does not use a specific value of , but instead looks for spikes that separate clusters. This can be used to obtain a hierarchical clustering that cannot be achieved by a single DBSCAN run.
Complexity
Like DBSCAN, OPTICS processes each point once, and performs one -neighborhood query during this processing. Given a spatial index that grants a neighborhood query in runtime, an overall runtime of is obtained. The authors of the original OPTICS paper report an actual constant slowdown factor of 1.6 compared to DBSCAN. Note that the value of might heavily influence the cost of the algorithm, since a value too large might raise the cost of a neighborhood query to linear complexity.
In particular, choosing (larger than the maximum distance in the data set) is possible, but will obviously lead to quadratic complexity, since every neighborhood query will return the full data set. Even when no spatial index is available, this comes at additional cost in managing the heap. Therefore, should be chosen appropriately for the data set.
Extensions
OPTICS-OF[3] is an outlier detection algorithm based on OPTICS. The main use is the extraction of outliers from an existing run of OPTICS at low cost compared to using a different outlier detection method.
DeLi-Clu,[4] Density-Link-Clustering combines ideas from single-linkage clustering and OPTICS, eliminating the parameter and offering performance improvements over OPTICS.
HiSC[5] is a hierarchical subspace clustering (axis-parallel) method based on OPTICS.
HiCO[6] is a hierarchical correlation clustering algorithm based on OPTICS.
DiSH[7] is an improvement over HiSC that can find more complex hierarchies.
Availability
Implementations of OPTICS, OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH are available in the ELKI data mining framework (with index acceleration). An incomplete and slow implementation can be found in the Weka extensions. The MRC National Institute for Medical Research provides a C reimplementation of OPTICS without index support.
References
43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.
- ↑ 55 years old Systems Administrator Antony from Clarence Creek, really loves learning, PC Software and aerobics. Likes to travel and was inspired after making a journey to Historic Ensemble of the Potala Palace.
You can view that web-site... ccleaner free download - ↑ 55 years old Systems Administrator Antony from Clarence Creek, really loves learning, PC Software and aerobics. Likes to travel and was inspired after making a journey to Historic Ensemble of the Potala Palace.
You can view that web-site... ccleaner free download - ↑ 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.
My blog: http://www.primaboinca.com/view_profile.php?userid=5889534 - ↑ Template:Cite doi
- ↑ Template:Cite doi
- ↑ Template:Cite doi
- ↑ Template:Cite doi