|
|
Line 1: |
Line 1: |
| In [[computer science]], '''Hirschberg's algorithm''', named after its inventor, [[Dan Hirschberg]], is a [[dynamic programming]] [[algorithm]] that finds the optimal [[sequence alignment]] between two [[string (computer science)|string]]s. Optimality is measured with the [[Levenshtein distance]], defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a [[divide and conquer algorithm|divide and conquer]] version of the [[Needleman–Wunsch algorithm]].<ref>[http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Hirsch/ Hirschberg's algorithm<!-- Bot generated title -->]</ref> Hirschberg's algorithm is commonly used in [[computational biology]] to find maximal global alignments of [[DNA]] and [[protein]] sequences.
| | Hello! Let me begin by stating my name - Ron Stephenson. Years ago we moved to Kansas. The preferred hobby for my children and me is playing crochet and now I'm attempting to earn money with it. I am a production and distribution officer.<br><br>Also visit my site :: [http://Bikedance.com/blogs/post/29704 http://Bikedance.com] |
| | |
| ==Algorithm information==
| |
| Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment. [[BLAST]] and [[FASTA]] are suboptimal [[Heuristic (computer science)|heuristics]]. If ''x'' and ''y'' are strings, where length(''x'') = ''n'' and length(''y'') = ''m'', the [[Needleman-Wunsch algorithm]] finds an optimal alignment in [[Big O Notation|O]](''nm'') time, using O(''nm'') space. Hirschberg's algorithm is a clever modification of the Needleman-Wunsch Algorithm which still takes O(''nm'') time, but needs only O(min{''n'',''m''}) space.<ref>http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec02/node10.html</ref>
| |
| One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the [[longest common subsequence problem|longest common subsequence]] between two sets of data such as with the common [[diff]] tool.
| |
| | |
| The Hirschberg algorithm can be derived from the Needleman-Wunsch algorithm by observing that:<ref>{{cite journal|author=Hirschberg, D. S.|title=A linear space algorithm for computing maximal common subsequences|journal=Communications of the ACM|volume=18|issue=6|year=1975|pages=341–343|doi=10.1145/360825.360861}}</ref>
| |
| # one can compute the optimal alignment score by only storing the current and previous row of the Needleman-Wunsch score matrix;
| |
| # if <math>(Z,W) = \operatorname{NW}(X,Y)</math> is the optimal alignment of <math>(X,Y)</math>, and <math>X = X^l + X^r</math> is an arbitrary partition of <math>X</math>, there exists a partition <math>Y^l + Y^r</math> of <math>Y</math> such that <math>\operatorname{NW}(X,Y) = \operatorname{NW}(X^l,Y^l) + \operatorname{NW}(X^r,Y^r)</math>.
| |
| | |
| == Algorithm description ==
| |
| | |
| <math>X_i</math> denotes the i-th character of <math>X</math>, where <math>1 < i \leqslant \operatorname{length}(X)</math>. <math>X_{i:j}</math> denotes a substring of size <math>j-i+1</math>, ranging from i-th to the j-th character of <math>X</math>. <math>\operatorname{rev}(X)</math> is the reversed version of <math>X</math>.
| |
| | |
| <math>X</math> and <math>Y</math> are sequences to be aligned. Let <math>x</math> be a character from <math>X</math>, and <math>y</math> be a character from <math>Y</math>. We assume that <math>\operatorname{Del}(x)</math>, <math>\operatorname{Ins}(y)</math> and <math>\operatorname{Sub}(x,y)</math> are well defined integer-valued functions. These functions represent the cost of deleting <math>x</math>, inserting <math>y</math>, and replacing <math>x</math> with <math>y</math>, respectively.
| |
| | |
| We define <math>\operatorname{NWScore}(X,Y)</math>, which returns the last line of the Needleman-Wunsch score matrix <math>\mathrm{Score}(i,j)</math>:
| |
| | |
| '''function''' NWScore(X,Y)
| |
| Score(0,0) = 0
| |
| '''for''' j=1 '''to''' length(Y)
| |
| Score(0,j) = Score(0,j-1) + Ins(Y<sub>j</sub>)
| |
| '''for''' i=1 '''to''' length(X)
| |
| Score(i,0) = Score(i-1,0) + Del(X<sub>i</sub>)
| |
| '''for''' j=1 '''to''' length(Y)
| |
| scoreSub = Score(i-1,j-1) + Sub(X<sub>i</sub>, Y<sub>j</sub>)
| |
| scoreDel = Score(i-1,j) + Del(X<sub>i</sub>)
| |
| scoreIns = Score(i,j-1) + Ins(Y<sub>j</sub>)
| |
| Score(i,j) = max(scoreSub, scoreDel, scoreIns)
| |
| '''end'''
| |
| '''end'''
| |
| '''for''' j=0 '''to''' length(Y)
| |
| LastLine(j) = Score(length(X),j)
| |
| '''return''' LastLine
| |
| | |
| Note that at any point, <math>\operatorname{NWScore}</math> only requires the two most recent rows of the score matrix. Thus, <math>\operatorname{NWScore}</math> can be implemented in <math>O(\operatorname{min}\{\operatorname{length}(X),\operatorname{length}(Y)\})</math> space.
| |
| | |
| The Hirschberg algorithm follows:
| |
| | |
| '''function''' Hirschberg(X,Y)
| |
| Z = ""
| |
| W = ""
| |
| '''if''' length(X) == 0 '''or''' length(Y) == 0
| |
| '''if''' length(X) == 0
| |
| '''for''' i=1 '''to''' length(Y)
| |
| Z = Z + '-'
| |
| W = W + Y<sub>i</sub>
| |
| '''end'''
| |
| '''else if''' length(Y) == 0
| |
| '''for''' i=1 '''to''' length(X)
| |
| Z = Z + X<sub>i</sub>
| |
| W = W + '-'
| |
| '''end'''
| |
| '''end'''
| |
| '''else if''' length(X) == 1 '''or''' length(Y) == 1
| |
| (Z,W) = NeedlemanWunsch(X,Y)
| |
| '''else'''
| |
| xlen = length(X)
| |
| xmid = length(X)/2
| |
| ylen = length(Y)
| |
|
| |
| ScoreL = NWScore(X<sub>1:xmid</sub>, Y)
| |
| ScoreR = NWScore(rev(X<sub>xmid+1:xlen</sub>), rev(Y))
| |
| ymid = PartitionY(ScoreL, ScoreR)
| |
|
| |
| (Z,W) = Hirschberg(X<sub>1:xmid</sub>, y<sub>1:ymid</sub>) + Hirschberg(X<sub>xmid+1:xlen</sub>, Y<sub>ymid+1:ylen</sub>)
| |
| '''end'''
| |
| '''return''' (Z,W)
| |
| | |
| In the context of Observation (2), assume that <math>X^l + X^r</math> is a partition of <math>X</math>. Function <math>\mathrm{PartitionY}</math> returns index <math>\mathrm{ymid}</math> such that <math>Y^l = Y_{1:\mathrm{ymid}}</math> and <math>Y^r = Y_{\mathrm{ymid}+1:\operatorname{length}(Y)}</math>. <math>\mathrm{PartitionY}</math> is given by
| |
| | |
| '''function''' PartitionY(ScoreL, ScoreR)
| |
| '''return''' [[arg max]] ScoreL + rev(ScoreR)
| |
| | |
| == Example ==
| |
| | |
| Let
| |
| | |
| <math>
| |
| \begin{align}
| |
| X &= \mathrm{AGTACGCA},\\
| |
| Y &= \mathrm{TATGC},\\
| |
| \operatorname{Del}(x) &= -2,\\
| |
| \operatorname{Ins}(y) &= -2,\\
| |
| \operatorname{Sub}(x,y) &= \begin{cases} +2, & \mbox{if } x = y \\ -1, & \mbox{if } x \neq y.\end{cases}
| |
| \end{align}
| |
| </math>.
| |
| | |
| The optimal alignment is given by
| |
| | |
| W = AGTACGCA
| |
| Z = --TATGC-
| |
| | |
| Indeed, this can be verified by backtracking its corresponding Needleman-Wunsch matrix:
| |
| | |
| '''T A T G C'''
| |
| '''0''' -2 -4 -6 -8 -10
| |
| '''A''' '''-2''' -1 0 -2 -4 -6
| |
| '''G''' '''-4''' -3 -2 -1 0 -2
| |
| '''T''' -6 '''-2''' -4 0 -2 -1
| |
| '''A''' -8 -4 '''0''' -2 -1 -3
| |
| '''C''' -10 -6 -2 '''-1''' -3 1
| |
| '''G''' -12 -8 -4 -3 '''1''' -1
| |
| '''C''' -14 -10 -6 -5 -1 '''3'''
| |
| '''A''' -16 -12 -8 -7 -3 '''1'''
| |
| | |
| One starts with the top level call to <math>\operatorname{Hirschberg}(\mathrm{AGTACGCA}, \mathrm{TATGC})</math>. The call to <math>\operatorname{NWScore}(\mathrm{AGTA},Y)</math> produces the following matrix:
| |
| | |
| '''T A T G C'''
| |
| 0 -2 -4 -6 -8 -10
| |
| '''A''' -2 -1 0 -2 -4 -6
| |
| '''G''' -4 -3 -2 -1 0 -2
| |
| '''T''' -6 -2 -4 0 -2 -1
| |
| '''A''' -8 -4 0 -2 -1 -3
| |
| | |
| Likewise, <math>\operatorname{NWScore}(\operatorname{rev}(\mathrm{CGCA}), \operatorname{rev}(Y))</math> generates the following matrix:
| |
| | |
| '''C G T A T'''
| |
| 0 -2 -4 -6 -8 -10
| |
| '''A''' -2 -1 -3 -5 -4 -6
| |
| '''C''' -4 0 -2 -4 -6 -5
| |
| '''G''' -6 -2 2 0 -2 -4
| |
| '''C''' -8 -4 0 1 -1 -3
| |
| | |
| Their last lines are respectively
| |
| | |
| ScoreL = [ -8 -4 0 -2 -1 -3 ]
| |
| ScoreR = [ -8 -4 0 1 -1 -3 ]
| |
| | |
| <tt>PartitionY(ScoreL, ScoreR) = 2</tt>, such that <math>X = \mathrm{AGTA} + \mathrm{CGCA}</math> and <math>Y = \mathrm{TA} + \mathrm{TGC}</math>.
| |
| | |
| The entire Hirschberg recursion (which we omit for brevity) produces the following tree:
| |
| | |
| (AGTACGCA,TATGC)
| |
| / \
| |
| (AGTA,TA) (CGCA,TGC)
| |
| / \ / \
| |
| (AG,) (TA,TA) (CG,TG) (CA,C)
| |
| / \ / \
| |
| (T,T) (A,A) (C,T) (G,G)
| |
| | |
| The leaves of the tree contain the optimal alignment.
| |
| | |
| ==See also==
| |
| * [[Needleman-Wunsch algorithm]]
| |
| * [[Smith Waterman algorithm]]
| |
| * [[Levenshtein distance]]
| |
| * [[Longest common subsequence problem|Longest Common Subsequence]]
| |
| | |
| ==References==
| |
| {{reflist}}
| |
| | |
| {{DEFAULTSORT:Hirschberg's Algorithm}}
| |
| [[Category:Sequence alignment algorithms]]
| |
| [[Category:Bioinformatics algorithms]]
| |
| [[Category:Articles with example pseudocode]]
| |
| [[Category:Dynamic programming]]
| |