Problem Origin:

Nuclear magnetic resonance(NMR) spectroscopy is widely used to solve protein structure[4]. There are three steps to determin protein structure by NMR. Data generated by NMR is called **resonance peaks**. A **spin system** is a group of resonance peaks corresponding to the same amino acide, which is the basic character of a protein sequence. One of the interesting computing problem is NMR spectral peak assignment i.e. how to relate spin systems with amino acids.

This NMR spectral peak assignment can be viewed as a * Constrained Bipartite Matching* problem (CBM). We view each spin system as a vertex in set V and each amino acid as a vertex in set U. Edges between U and V represent potential assignments. The edges may be weighted which specify the preference of a particular assignment. The constraint of the CBM problem is introduced by the biological fact: Spin systems from one NMR experiment are known to belong to the consecutive amino acides. The goal is asking for the

Problem Definition:

Let's consider the simplified version of the NMR spectral peak assignment problem:
Given a graph G=(U, V, E). The verticies in V set are given as 2 types. Any edge associated with type-1 vertices can be chosen into a match. Type-2 vertices are adjacently located in V, eg v_{j} and v_{j+1} , and the the edge (u_{i}, v_{j}) can be chosen into a match iff (u_{i+1}, v_{j+1}) is chosen into the match as well. We denote the type-2 vertices as string as they require the edges being selected as a whole. The maximum length of a string is denoted by B. And the problem can be viewed as B-string constrained bipartite matching. The traditional maximum matching problem is the 1-string unweighted CBM problem, which has polynomial solution. But when D is bigger than 2, the problem becomes hard to tackle. From now on, we will only consider the unweighted 2-string CBM problem in which the maximum length of a type-2 string is 2. Fig1 is an instance of 2-string unweighted CBM problem.

Another way to help understanding the unweighted 2-string CBM problem is to treat it as an *interval scheduling* problem[1], in which each vertex in U is viewed as a time slot and a vertex in V is viewed as a job. More particularly, type-1 vertices in V are jobs that need 1 time slot whereas a type-2 vertex-pair(2-string pair) is viewed as a job that need 2 consecutive time slots. The goal is asking for the maximum job numbers that can be executed without conflits and the job number is counted as type-1 job being 1 and type-2 job being 2.

Hardness of Constrained Bipartite Matching:

- MAX SNP-HARD Class and L-reduction

- The L-reduction is defined as follows:

Given two optimization problem A, B. A L-reducible to B if for any instance a of A, there is a polynomial algorithm f to generate an instance b of B such that the OPT(b) ≤ OPT(a). And given any feasible solution c

**Theorem**2-string unweighted CBM is MAX SNP-hard

To prove the MAX SNP hardness of a problem, one need to construct an L-reduction from a known MAX SNP-complete problem. Here we construct the L-reduction from the

- L-reduction from an instance of MB3DM to an instance of CBM:

For a given instance of MB3DM, assume m = 3q and n ≥ q . We'd like to construct the instance of 2-string CBM. i.e, generate bipartite graph G = (U, V, E). The approximation solution of these two algorithms should follow some constraint, such that if CBM problem has a PTAS, so does the MB3DM instance. Here is the construction detail:

In order to construct an instance of CBM, one need to define U, V and E of the bipartite graph G.

- Set U is consisting of 7n vertices, with each subset Si in the MB3DM corresponding to 7 vertices as a
_{i1}, a_{i2}, a_{i3}, a_{i4}, a_{i5}, a_{i6}, a_{i7}in set U of the instance of CBM. - Set V is consisting of 3 different kinds of vertices.
- Construct q vertices f
_{1}, f_{2}, ... f_{q}in set V, we call them f-type vertices in V. - For each element i in the universal set H of MB3DM instance, construct 2 vertices called b
_{i1}, b_{i2}in the V set of CBM instance. We call them b-type vertices in V. - For each subset Si, construct 6 vertices as c
_{i1}, c_{i2}, c_{i3}, c_{i4}, c_{i5}, c_{i6}in the V set of CBM instance. We call them c-type vertices in V.

- Construct q vertices f
- Set E is defined as associtated with different type of V vertices.
- for each f-type vertices f
_{i}in V, connect it with a_{j1}for j = 1 .. n, i.e., connect it with all the first vertices in the 7 vertices group which corresponding to the subset in MB3DM. - for each b-type vertices pair b
_{i1}b_{i2}corresponding to each element in universal set H of MB3DM , connect them with a_{j2}a_{j3}, a_{j4}a_{j5}, a_{j6}a_{j7}in vertices set U respectively if subset Sj contains element i in the MB3DM instance. - for each c-type vertices group c
_{i1}, c_{i2}, c_{i3}, c_{i4}, c_{i5}, c_{i6}, connect them with a_{j1}, a_{j2}, a_{j3}, a_{j4}, a_{j5}, a_{j6}respectively for j = 1 .. n, i.e., each 6 c-type vertices have edges with all the first 6 vertices in the 7 vertices group in set U, with each group corresponding to the subset Si in MB3DM.

- for each f-type vertices f

The above construction gives an instance of CBM with |U| = 7n, |V| = q + 2 × m + 6n = 7q + 6n. Based on the definition of E, we can tell that in the V set, b and c are type-2 vertices, which have constraint on them, whereas f vertices are type-1 vertices. Here is an example of the L-reduction. The instance of MB3DM is given as U = {1,2,3,4}, S1 = {1,2,3} and S2 = {2,3,4}. The corresponding instance of CBM is shown in the Fig2:

- Why the reduction works?

Assume a feasible solution to MB3DM has size p (p ≤ q), which means there are p subsets which are pair wise disjoint. The feasible solution to the corresponding CBM problem has the size of 7p + 6(n-p) = 6n + p. Here is the explanation:

To get the matching number maximized, one prefer to match as many vertices in each 7-vertices of U as possible. The idea is for every pair disjoint subset, match all their corresponding 7-vertices with a f-type vertices in V and 3 more b-type vertices pairs corresponding to its containg elements. For the rest of the subset, simply choose its corresponding c-type vertices to get 6 out of its corresponding 7-vertices in U being matched. The rest of the subset can not achieve 7 matches simply because the way to get 7 match is have 1 f-type with 6 b-type vertices. And the subset may share the element with other subset, thus can not be used to get 7. Also f-type and c-type conflict with each other as they all use the first vertex in every 7-vertices group of U.

For example in fig2: If we pick S1 as the feasible solution to MB3DM. In set U, a_{11}, a_{12}, a_{13}, a_{14}, a_{15}, a_{16},a_{17} can be matched with f_{1}, b_{11}, b_{12}, b_{21}, b_{22}, b_{31}, b_{32} respectively. As for S2, since b_{21}, b_{22}, b_{31}, b_{32} have already been used by S1, its corresponding a_{21}, a_{22}, a_{23}, a_{24}, a_{25}, a_{26},a_{27} can only have a_{22}, a_{23}, a_{24}, a_{25}, a_{26},a_{27} match with c_{21}, c_{22}, c_{23}, c_{24}, c_{25}, c_{26}.

Now we need to prove that if CBM has PTAS, so does MB3DM.
Proof:

Assume MB3DM has a feasible solution with cost p, so the CBM has cost 6n + p, denote the OPT of MB3DM as p^{*}, the OPT for CBM is 6n + p^{*}
Assume CBM has PTAS, i.e. there is a feasible solution with cost 6n+p having the following inequality holds.

6n+p ≤ (1+ε) (6n + p^{*})

p ≤ (1+ε)p^{*} + 6nε

Recall that in MB3DM instance, each element can appear in no more than 3 subset. Thus for a given subset, the maximum possible number of subsets that may conflict with it is 6 with each element appear in 2 more different subsets. So we have :

p^{*} ≥ n/7 which means if we pick one subset out of seven, we can sure have no overlap ones.
Combine with the previous inequality, we get:

p ≤ (1+43ε)p^{*} if we replace 43ε with ε^{'} , we get the PTAS for MB3DM problem.

The 5/3 approximation algorithm

Since there is no PTAS existing for 2-string CBM problem. We will try to make a constant factor approximation algorithm. Here we introduce the 5/3 approximation algorithm, which means that cost(AppAlg)≥ OPT . To help understanding the algorithm, Fig1 will be used as an example CBM instance. Fig3 shows the optimal matching of Fig1.

Denote the size of constrained bipartite matching of an algrithm A as cost(A), the size of the maximum constrained bipartite matching as M* and m1* as the number of type-1 edges in M* and m2* as the number of type-2 vertex-pairs (2-string). Obviously |M*| = m_{1}^{*} + 2m_{2}^{*}

- Algorithm A with cost(A) ≥ m
_{1}^{*}+ m_{2}^{*}

- Modify the instance of CBM G = (U, V, E) to construct a new bipartite graph G' = (U, V, E') with E' as removing all the edges associated with the second vertex of the 2-string vertices pair. Find the maximum matching of G' as M'. We have |M'| ≥ m
_{1}^{*}+ m_{2}^{*} - Obtain a feasible matching M of G from M' by following steps:
- Simply expand M' to be TEMP_M by adding all the edges associated with the second vertex of the 2-string vertices pair if the edge associated with the first one is in M', i.e. copy edge e into TEMP_M if e associated with type-1 vertex; expand e into two edges e = u
_{i}v_{j}and e' = u_{i+1}v_{j+1}if e associated with the first vertex of a 2-string vertex pair v_{j}, v_{j}+1 i.e.(e = u_{i}v_{j}). - For each edge e in TEMP_M, if it conflict with no other edges in TEMP_M, copy them into M.
- For the edges that conflict with each other in TEMP_M, get its corresponding edge in M' say (u
_{i},v_{j1}),(u_{i+1},v_{j2}),...(u_{i+h-1},v_{jh}), it is obvious that (u_{i},v_{j1}) is the first edge of the two edges associated with a 2-string vertex pair. There are three cases to add edges into M':- If h is even, add (u
_{i},v_{j1})(u_{i+1},v_{j1+1}), (u_{i+2},v_{j3})(u_{i+3},v_{j3+1})..., (u_{i+h-2},v_{jh-1})(u_{i+h-1},v_{jh-1+1}) to M. - If h is odd and v
_{jh}is a type-1 vertex in V: add (u_{i},v_{j1})(u_{i+1},v_{j1+1}), (u_{i+2},v_{j3})(u_{i+3},v_{j3+1})..., (u_{i+h-4},v_{jh-3})(u_{i+h-3},v_{jh-2}) and (u_{i+h-1},v_{jh}) to M. - If h is odd and v
_{jh}is the first vertex of a type-2 vertex-pair in V: add (u_{i},v_{j1})(u_{i+1},v_{j1+1}), (u_{i+2},v_{j3})(u_{i+3},v_{j3+1})..., (u_{i+h-1},v_{jh})(u_{i+h},v_{jh+1}) to M.

- If h is even, add (u

- Simply expand M' to be TEMP_M by adding all the edges associated with the second vertex of the 2-string vertices pair if the edge associated with the first one is in M', i.e. copy edge e into TEMP_M if e associated with type-1 vertex; expand e into two edges e = u

Fig 4 shows modified graph G', matching M' of G' and M of G.

Analysis:
Based on the construction, we have |M| ≥ |M'| ≥ m_{1}^{*} + m_{2}^{*}.

- Algorithm B with cost(B) ≥ m
_{1}^{*}/3 + 4m_{2}^{*}/3

- For the given CBM instance G=(U, V, E), construct 3 different edge-weighted bipartite graph G
_{1}, G_{2}, G_{3}, G_{i}= (U_{i}, V_{i}, E_{i}), as follows:- V
_{i}is defined as: For set V, merge every type-2 vertices v_{j}v_{j+1}in V into a single super-vertex s_{j,j+1}. - U
_{i}is defined as: For set U, regroup consecutive u_{i}into tuples with each tuple renamed as t_{j,j+1,j+2}if the group is consisting of u_{j},u_{j+1},u_{j+2}. For G_{i}U is regrouped from u_{i}. If neither u_{1}nor u_{2}is grouped, group them as t_{1,2}. If neither u_{n-1}nor u_{n}is grouped, group them as t_{n-1,n}. - E
_{i}is defined as:- For type-1 vertex v
_{h}in V_{i}: If there is an edge between v_{h}and u_{j+1}, add a 1-weight edge between v_{h}and t_{j,j+1,j+2}into E_{i}if super-vertex t_{j,j+1,j+2}is in E_{i}. If there is an edge between v_{h}and u_{1}or between v_{h}and u_{n}, add a 1-weight edge between v_{h}and t_{1,2}or t_{n-1,n}into E_{i}respectively if super-vertex t_{1,2}or t_{n-1,n}is in E_{i}. - For type-2 super vertex s
_{h}s_{h+1}in V: add a 2-weight edge between s_{h,h+1}and t_{j,j+1,j+2}into E_{i}if there is an edge pair either between v_{h}v_{h+1}and u_{j}u_{j+1}or between v_{h}v_{h+1}and u_{j+1}u_{j+2}in original grah G; add a 2-weight edge between s_{h,h+1}and t_{1,2}or t_{n-1,n}into E_{i}if there is an edge pair either between v_{h}v_{h+1}and u_{1}u_{2}or between v_{h}v_{h+1}and u_{n-1}u_{n}in original graph G.

- For type-1 vertex v

- V
- For each graph G
_{i}, find the maximum-weighted matching of G_{i}as M_{i}. - Expand each M
_{i}into a feasible matching M_{i}of G by reversing the construction steps. - Take the maximum of M
_{1}, M_{2}and M_{3}as M to be the feasible solution of G.

Fig 5 shows modified graph G

Analysis

We need to show that |M| ≥ m_{1}^{*}/3 + 4m_{2}^{*}/3. Denote M* as the OPT of CBM. Define M_{i}^{*} as follows:
Starts from M^{*}, in the G_{i}

- For the edge associated with type-1 vertices in G, say (u
_{j}, v_{h})- if there is vertex t
_{j-1,j,j+1}, add a weight 1 edge t_{j-1,j,j+1}, v_{h}to M_{i}^{*}. - if there is vertex t
_{j-1,j}or t_{j,j+1}or u_{j}, add a weight 1 edge t_{j-1,j}or t_{j,j+1}or u_{j}to M_{i}^{*}respectively.

- if there is vertex t
- For the pair of edges associated with type-2 vertices in G, say (u
_{j},v_{h}),(u_{j+1},v_{h+1}), add a weight 2 edge to M_{i}^{*}if u_{j},u_{j+1}belongs to the same super-vertex in G_{i}.

Since each edge in M* that incident to a type-1 vertex belongs to exactly 1 of M_{0}^{*}, M_{1}^{*}, M_{2}^{*} and each edge in M* that incident to a type-2 vertex pair belongs to exactly 2 of M_{0}^{*}, M_{1}^{*}, M_{2}^{*}. Max(M_{0}^{*}, M_{1}^{*}, M_{2}^{*}) ≥ m_{1}*/3 + 4m_{2}* /3.

A matching M_{i}' in G_{i} can be obtained by Modifying M_{i}* reversly with the same weight. Again, since M_{i} is a maximum-weighted matching in G_{i}, the following inequality holds:

|M| ≥ MAX(M_{i}') = m_{1}*/3 + 4m_{2}*/3

- The 5/3 approximation algorithm

Proof:

Let a be real number between [0,1], which stands for the partial of algorithm A. we try to estimate the approximation ration b of the Max(A,B)

Max(A, B) ≥ a × cost(A) + (1-a) × cost(B) ≥ a × (m

Recall that we are trying to get the approximation ratio b, which means trying to find b s.t. Max(A,B) ≥ b × OPT

(1/3 + 2a/3) × m

Reference:

[1] Chen ZZ, Jiang T, Lin GH, et al. **More reliable protein NMR peak assignment via improved 2-interval scheduling*** LECTURE NOTES IN COMPUTER SCIENCE* 2832: 580-592 2003

[2] Chen ZZ, Jiang T, Lin GH, et al. **Approximation algorithms for NMR spectral peak assignment*** THEOR COMPUT SCI* 299 (1-3): 211-229 APR 18 2003

[3] Chen ZZ, Jiang T, Lin GH, et al. **Improved approximation algorithms for NMR spectral peak assignment*** LECT NOTES COMPUT SCI* 2452: 82-96 2002

[4] Xu Y, Xu D, Kim D, et al. **Automated assignment of backbone NMR peaks using constrained bipartite matching*** COMPUT SCI ENG* 4 (1): 50-62 JAN-FEB 2002

[5] Papadimitriou CH, Yannakakis M. **Optimization, Approximation, and Complexity Classes*** JOURNAL OF COMPUTER AND SYSTEM SCIENCES* 43: 425-440 1991