# ClassW04ApproxAlgs/ZhengLiu

Top | ClassW04ApproxAlgs | recent changes | Preferences

### Approximation algorithms for NMR spectral peak assignment

Problem Origin:

Nuclear magnetic resonance(NMR) spectroscopy is widely used to solve protein structure. There are three steps to determin protein structure by NMR. Data generated by NMR is called resonance peaks. A spin system is a group of resonance peaks corresponding to the same amino acide, which is the basic character of a protein sequence. One of the interesting computing problem is NMR spectral peak assignment i.e. how to relate spin systems with amino acids.

This NMR spectral peak assignment can be viewed as a Constrained Bipartite Matching problem (CBM). We view each spin system as a vertex in set V and each amino acid as a vertex in set U. Edges between U and V represent potential assignments. The edges may be weighted which specify the preference of a particular assignment. The constraint of the CBM problem is introduced by the biological fact: Spin systems from one NMR experiment are known to belong to the consecutive amino acides. The goal is asking for the MAXIMUM MATCHING among all the feasible matchings. In a feasible matching, vertices in V corresponding to spin systmes from same NMR experiment should match the vertices in U corresponding to consecutive amino acides.

Problem Definition:

Let's consider the simplified version of the NMR spectral peak assignment problem: Given a graph G=(U, V, E). The verticies in V set are given as 2 types. Any edge associated with type-1 vertices can be chosen into a match. Type-2 vertices are adjacently located in V, eg vj and vj+1 , and the the edge (ui, vj) can be chosen into a match iff (ui+1, vj+1) is chosen into the match as well. We denote the type-2 vertices as string as they require the edges being selected as a whole. The maximum length of a string is denoted by B. And the problem can be viewed as B-string constrained bipartite matching. The traditional maximum matching problem is the 1-string unweighted CBM problem, which has polynomial solution. But when D is bigger than 2, the problem becomes hard to tackle. From now on, we will only consider the unweighted 2-string CBM problem in which the maximum length of a type-2 string is 2. Fig1 is an instance of 2-string unweighted CBM problem.

Another way to help understanding the unweighted 2-string CBM problem is to treat it as an interval scheduling problem, in which each vertex in U is viewed as a time slot and a vertex in V is viewed as a job. More particularly, type-1 vertices in V are jobs that need 1 time slot whereas a type-2 vertex-pair(2-string pair) is viewed as a job that need 2 consecutive time slots. The goal is asking for the maximum job numbers that can be executed without conflits and the job number is counted as type-1 job being 1 and type-2 job being 2.

Hardness of Constrained Bipartite Matching:

• MAX SNP-HARD Class and L-reduction
In 1991, Papadimitrou and Yannakakis introduced MAX-SNP to evaluate the hardness of approximating an optimization problem. If an optimization problem has constant factor approximatation algorithm but has no PTAS (Polynomial Time Approximation Schema) unless P=NP, we claim that this problem belongs to MAX SNP. The L-reduction (Linear reduction) is also introduced to treat the approximation.

• The L-reduction is defined as follows:

Given two optimization problem A, B. A L-reducible to B if for any instance a of A, there is a polynomial algorithm f to generate an instance b of B such that the OPT(b) ≤ OPT(a). And given any feasible solution cb for b with cost(cb), there is a polynomial algorithm g to generate feasible solution ca for a with cost(ca) s.t. the relative error of ca is no more than the relative error of cb eg. |cost(ca) - OPT(a)| ≤ |cost(cb) - OPT(b)|

To prove the MAX SNP hardness of a problem, one need to construct an L-reduction from a known MAX SNP-complete problem. Here we construct the L-reduction from the MAXIMUM BOUNDED 3-DIMENSIONAL MATCHING (MB3DM), which is MAX SNP-complete. THe MB3DM is defined as follows: Given a universal set H = {1,2,3... m} and subsets of H as S1, S2, ..., Sn, with each subset Si contain exactly 3 elements in U. Each element in U apears in at most 3 subsets. The goal is asking for the maximum number of pairwise disjoint subsets.

• L-reduction from an instance of MB3DM to an instance of CBM:

For a given instance of MB3DM, assume m = 3q and n ≥ q . We'd like to construct the instance of 2-string CBM. i.e, generate bipartite graph G = (U, V, E). The approximation solution of these two algorithms should follow some constraint, such that if CBM problem has a PTAS, so does the MB3DM instance. Here is the construction detail:
In order to construct an instance of CBM, one need to define U, V and E of the bipartite graph G.
1. Set U is consisting of 7n vertices, with each subset Si in the MB3DM corresponding to 7 vertices as ai1, ai2, ai3, ai4, ai5, ai6, ai7 in set U of the instance of CBM.
2. Set V is consisting of 3 different kinds of vertices.
1. Construct q vertices f1, f2, ... fq in set V, we call them f-type vertices in V.
2. For each element i in the universal set H of MB3DM instance, construct 2 vertices called bi1, bi2 in the V set of CBM instance. We call them b-type vertices in V.
3. For each subset Si, construct 6 vertices as ci1, ci2, ci3, ci4, ci5, ci6 in the V set of CBM instance. We call them c-type vertices in V.
3. Set E is defined as associtated with different type of V vertices.
1. for each f-type vertices fi in V, connect it with aj1 for j = 1 .. n, i.e., connect it with all the first vertices in the 7 vertices group which corresponding to the subset in MB3DM.
2. for each b-type vertices pair bi1bi2 corresponding to each element in universal set H of MB3DM , connect them with aj2aj3, aj4aj5, aj6aj7 in vertices set U respectively if subset Sj contains element i in the MB3DM instance.
3. for each c-type vertices group ci1, ci2, ci3, ci4, ci5, ci6, connect them with aj1, aj2, aj3, aj4, aj5, aj6 respectively for j = 1 .. n, i.e., each 6 c-type vertices have edges with all the first 6 vertices in the 7 vertices group in set U, with each group corresponding to the subset Si in MB3DM.

The above construction gives an instance of CBM with |U| = 7n, |V| = q + 2 × m + 6n = 7q + 6n. Based on the definition of E, we can tell that in the V set, b and c are type-2 vertices, which have constraint on them, whereas f vertices are type-1 vertices. Here is an example of the L-reduction. The instance of MB3DM is given as U = {1,2,3,4}, S1 = {1,2,3} and S2 = {2,3,4}. The corresponding instance of CBM is shown in the Fig2:

• Why the reduction works?
To prove the correcness of the L-reduction, let's first analyze the relationship between a feasible solution to MB3DM with that of the CBM problem.

Assume a feasible solution to MB3DM has size p (p ≤ q), which means there are p subsets which are pair wise disjoint. The feasible solution to the corresponding CBM problem has the size of 7p + 6(n-p) = 6n + p. Here is the explanation:

To get the matching number maximized, one prefer to match as many vertices in each 7-vertices of U as possible. The idea is for every pair disjoint subset, match all their corresponding 7-vertices with a f-type vertices in V and 3 more b-type vertices pairs corresponding to its containg elements. For the rest of the subset, simply choose its corresponding c-type vertices to get 6 out of its corresponding 7-vertices in U being matched. The rest of the subset can not achieve 7 matches simply because the way to get 7 match is have 1 f-type with 6 b-type vertices. And the subset may share the element with other subset, thus can not be used to get 7. Also f-type and c-type conflict with each other as they all use the first vertex in every 7-vertices group of U.

For example in fig2: If we pick S1 as the feasible solution to MB3DM. In set U, a11, a12, a13, a14, a15, a16,a17 can be matched with f1, b11, b12, b21, b22, b31, b32 respectively. As for S2, since b21, b22, b31, b32 have already been used by S1, its corresponding a21, a22, a23, a24, a25, a26,a27 can only have a22, a23, a24, a25, a26,a27 match with c21, c22, c23, c24, c25, c26.

Now we need to prove that if CBM has PTAS, so does MB3DM. Proof:
Assume MB3DM has a feasible solution with cost p, so the CBM has cost 6n + p, denote the OPT of MB3DM as p*, the OPT for CBM is 6n + p* Assume CBM has PTAS, i.e. there is a feasible solution with cost 6n+p having the following inequality holds.
6n+p ≤ (1+ε) (6n + p*)
p ≤ (1+ε)p* + 6nε

Recall that in MB3DM instance, each element can appear in no more than 3 subset. Thus for a given subset, the maximum possible number of subsets that may conflict with it is 6 with each element appear in 2 more different subsets. So we have :
p* ≥ n/7 which means if we pick one subset out of seven, we can sure have no overlap ones. Combine with the previous inequality, we get:
p ≤ (1+43ε)p* if we replace 43ε with ε' , we get the PTAS for MB3DM problem.

The 5/3 approximation algorithm
Since there is no PTAS existing for 2-string CBM problem. We will try to make a constant factor approximation algorithm. Here we introduce the 5/3 approximation algorithm, which means that cost(AppAlg)≥ OPT . To help understanding the algorithm, Fig1 will be used as an example CBM instance. Fig3 shows the optimal matching of Fig1.

Denote the size of constrained bipartite matching of an algrithm A as cost(A), the size of the maximum constrained bipartite matching as M* and m1* as the number of type-1 edges in M* and m2* as the number of type-2 vertex-pairs (2-string). Obviously |M*| = m1* + 2m2*

• Algorithm A with cost(A) ≥ m1* + m2*
1. Modify the instance of CBM G = (U, V, E) to construct a new bipartite graph G' = (U, V, E') with E' as removing all the edges associated with the second vertex of the 2-string vertices pair. Find the maximum matching of G' as M'. We have |M'| ≥ m1* + m2*
2. Obtain a feasible matching M of G from M' by following steps:
1. Simply expand M' to be TEMP_M by adding all the edges associated with the second vertex of the 2-string vertices pair if the edge associated with the first one is in M', i.e. copy edge e into TEMP_M if e associated with type-1 vertex; expand e into two edges e = uivj and e' = ui+1vj+1 if e associated with the first vertex of a 2-string vertex pair vj, vj+1 i.e.(e = uivj).
2. For each edge e in TEMP_M, if it conflict with no other edges in TEMP_M, copy them into M.
3. For the edges that conflict with each other in TEMP_M, get its corresponding edge in M' say (ui,vj1),(ui+1,vj2),...(ui+h-1,vjh), it is obvious that (ui,vj1) is the first edge of the two edges associated with a 2-string vertex pair. There are three cases to add edges into M':
1. If h is even, add (ui,vj1)(ui+1,vj1+1), (ui+2,vj3)(ui+3,vj3+1)..., (ui+h-2,vjh-1)(ui+h-1,vjh-1+1) to M.
2. If h is odd and vjh is a type-1 vertex in V: add (ui,vj1)(ui+1,vj1+1), (ui+2,vj3)(ui+3,vj3+1)..., (ui+h-4,vjh-3)(ui+h-3,vjh-2) and (ui+h-1,vjh) to M.
3. If h is odd and vjh is the first vertex of a type-2 vertex-pair in V: add (ui,vj1)(ui+1,vj1+1), (ui+2,vj3)(ui+3,vj3+1)..., (ui+h-1,vjh)(ui+h,vjh+1) to M.

Fig 4 shows modified graph G', matching M' of G' and M of G.

Analysis: Based on the construction, we have |M| ≥ |M'| ≥ m1* + m2*.

• Algorithm B with cost(B) ≥ m1*/3 + 4m2*/3
1. For the given CBM instance G=(U, V, E), construct 3 different edge-weighted bipartite graph G1, G2, G3, Gi = (Ui, Vi, Ei), as follows:
1. Vi is defined as: For set V, merge every type-2 vertices vjvj+1 in V into a single super-vertex sj,j+1.
2. Ui is defined as: For set U, regroup consecutive ui into tuples with each tuple renamed as tj,j+1,j+2 if the group is consisting of uj,uj+1,uj+2. For Gi U is regrouped from ui. If neither u1 nor u2 is grouped, group them as t1,2. If neither un-1 nor un is grouped, group them as tn-1,n.
3. Ei is defined as:
1. For type-1 vertex vh in Vi: If there is an edge between vh and uj+1, add a 1-weight edge between vh and tj,j+1,j+2 into Ei if super-vertex tj,j+1,j+2 is in Ei. If there is an edge between vh and u1 or between vh and un, add a 1-weight edge between vh and t1,2 or tn-1,n into Ei respectively if super-vertex t1,2 or tn-1,n is in Ei.
2. For type-2 super vertex shsh+1 in V: add a 2-weight edge between sh,h+1 and tj,j+1,j+2 into Ei if there is an edge pair either between vhvh+1 and ujuj+1 or between vhvh+1 and uj+1uj+2 in original grah G; add a 2-weight edge between sh,h+1 and t1,2 or tn-1,n into Ei if there is an edge pair either between vhvh+1 and u1u2 or between vhvh+1 and un-1un in original graph G.
2. For each graph Gi, find the maximum-weighted matching of Gi as Mi.
3. Expand each Mi into a feasible matching Mi of G by reversing the construction steps.
4. Take the maximum of M1, M2 and M3 as M to be the feasible solution of G.

Fig 5 shows modified graph G0, G1, G2, maximum weighted matching Mi of Gi and constrained matching Mi of G.

Analysis
We need to show that |M| ≥ m1*/3 + 4m2*/3. Denote M* as the OPT of CBM. Define Mi* as follows: Starts from M*, in the Gi

1. For the edge associated with type-1 vertices in G, say (uj, vh)
1. if there is vertex tj-1,j,j+1, add a weight 1 edge tj-1,j,j+1, vh to Mi*.
2. if there is vertex tj-1,j or tj,j+1 or uj, add a weight 1 edge tj-1,j or tj,j+1 or uj to Mi* respectively.
2. For the pair of edges associated with type-2 vertices in G, say (uj,vh),(uj+1,vh+1), add a weight 2 edge to Mi* if uj,uj+1 belongs to the same super-vertex in Gi.

Since each edge in M* that incident to a type-1 vertex belongs to exactly 1 of M0*, M1*, M2* and each edge in M* that incident to a type-2 vertex pair belongs to exactly 2 of M0*, M1*, M2*. Max(M0*, M1*, M2*) ≥ m1*/3 + 4m2* /3.

A matching Mi' in Gi can be obtained by Modifying Mi* reversly with the same weight. Again, since Mi is a maximum-weighted matching in Gi, the following inequality holds:

|M| ≥ MAX(Mi') = m1*/3 + 4m2*/3

• The 5/3 approximation algorithm
Denote C as the maximum of A and B, the approximation ratio of C is cost(C) ≥ 3/5 × OPT
Proof:
Let a be real number between [0,1], which stands for the partial of algorithm A. we try to estimate the approximation ration b of the Max(A,B)
Max(A, B) ≥ a × cost(A) + (1-a) × cost(B) ≥ a × (m1* + m2*) + (1-a) × (m1*/3 + 4m2*/3) = (1/3 + 2/3 × a) × m1* + (4/3 - a/3) × m2*)
Recall that we are trying to get the approximation ratio b, which means trying to find b s.t. Max(A,B) ≥ b × OPT
(1/3 + 2a/3) × m1* + (4/3 - a/3) × m2*) ≥ b OPT gives b value of 3/5. Thus the maximum of A and B can achieve 5/3 approximation.

Reference:

 Chen ZZ, Jiang T, Lin GH, et al. More reliable protein NMR peak assignment via improved 2-interval scheduling LECTURE NOTES IN COMPUTER SCIENCE 2832: 580-592 2003
 Chen ZZ, Jiang T, Lin GH, et al. Approximation algorithms for NMR spectral peak assignment THEOR COMPUT SCI 299 (1-3): 211-229 APR 18 2003
 Chen ZZ, Jiang T, Lin GH, et al. Improved approximation algorithms for NMR spectral peak assignment LECT NOTES COMPUT SCI 2452: 82-96 2002
 Xu Y, Xu D, Kim D, et al. Automated assignment of backbone NMR peaks using constrained bipartite matching COMPUT SCI ENG 4 (1): 50-62 JAN-FEB 2002
 Papadimitriou CH, Yannakakis M. Optimization, Approximation, and Complexity Classes JOURNAL OF COMPUTER AND SYSTEM SCIENCES 43: 425-440 1991

Top | ClassW04ApproxAlgs | recent changes | Preferences