## neal young / publications

• 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
We consider the problem of political redistricting: given the locations of people in a geographical area (e.g. a US state), the goal is to decompose the area into subareas, called districts, so that the populations of the districts are as close as possible and the districts are "compact" and "contiguous," to use the terms referred to in most US state constitutions and/or US Supreme Court rulings. We study a method that outputs a solution in which each district is the intersection of a convex polygon with the geographical area. The average number of sides per polygon is less than six. The polygons tend to be quite compact. Every two districts differ in population by at most one (so we call the solution balanced).

In fact, the solution is a centroidal power diagram: each polygon has an associated center in R3 such that
• the projection of the center onto the plane z = 0 is the centroid of the locations of people assigned to the polygon, and
• for each person assigned to that polygon, the polygon center is closest among all centers. The polygons are convex because they are the intersections of 3D Voronoi cells with the plane.
The solution is, in a well-defined sense, a locally optimal solution to the problem of choosing centers in the plane and choosing an assignment of people to those 2-d centers so as to minimize the sum of squared distances subject to the assignment being balanced.

A practical problem with this approach is that, in real-world redistricting, exact locations of people are unknown. Instead, the input consists of polygons (census blocks) and associated populations. A real redistricting must not split census blocks. We therefore propose a second phase that perturbs the solution slightly so it does not split census blocks. In our experiments, the second phase achieves this while preserving perfect population balance. The district polygons are no longer convex at the fine scale because their boundaries must follow the boundaries of census blocks, but at a coarse scale they preserve the shape of the original polygons.
• Minimization of labeling effort for person re-identification in camera networks is an important problem as most of the existing popular methods are supervised and they require large amount of manual annotations, acquiring which is a tedious job. In this work, we focus on this labeling effort minimization problem and approach it as a subset selection task where the objective is to select an optimal subset of image-pairs for labeling without compromising performance. Towards this goal, our proposed scheme first represents any camera network (with k number of cameras) as an edge weighted complete k-partite graph where each vertex denotes a person and similarity scores between persons are used as edge-weights. Then in the second stage, our algorithm selects an optimal subset of pairs by solving a triangle free subgraph maximization problem on the k-partite graph. This sub-graph weight maximization problem is NP-hard (at least for k larger than 3) which means for large datasets the optimization problem becomes intractable. In order to make our framework scalable, we propose two polynomial time approximately-optimal algorithms. The first algorithm is a 1/2-approximation algorithm which runs in linear time in the number of edges. The second algorithm is a greedy algorithm with sub-quadratic (in number of edges) time-complexity. Experiments on three state-of-the-art datasets depict that the proposed approach requires on an average only 8-15% manually labeled pairs in order to achieve the performance when all the pairs are manually annotated.
• An introduction to greedy approximation algorithms for combinatorial optimization problems. Informal definition of greedy algorithms. Greedy Set Cover, and other generalizations and applications. Greedy algorithms for other problems. Lagrangean relaxation based methods. Local ratio, greedy dual based schemes and greedy rounding. Connected Dominating Sets and other applications. Priority Based Approaches.
• IEEE International Conference on Data Engineering (ICDE)
The paper considers the problem of summarising a collection of reviews for an item; the application considered is that of reviews for doctors. We consider the case where there are aspects about the item that are captured in the review, and are organized in an ontology in the form of a DAG. There is also sentiment about the aspects, so the reviews can be thought of as aspect/sentiment pairs. The goal is to find a subset of them to cover all the pairs, where coverage takes into account the ontology structure. We propose and analyse algorithms, and perform quantitative and qualitative experiments on real data.
• Data Mining and Knowledge Discovery 30(1):243-281(2016)
Over the past decade, time series clustering has become an increasingly important research topic in data mining community. Most existing methods for time series clustering rely on distances calculated from the entire raw data using the Euclidean distance or Dynamic Time Warping distance as the distance measure. However, the presence of significant noise, dropouts, or extraneous data can greatly limit the accuracy of clustering in this domain. Moreover, for most real world problems, we cannot expect objects from the same class to be equal in length. As a consequence, most work on time series clustering only considers the clustering of individual time series behaviors,'' e.g., individual heart beats or individual gait cycles, and contrives the time series in some way to make them all equal in length. However, automatically formatting the data in such a way is often a harder problem than the clustering itself. In this work, we show that by using only some local patterns and deliberately ignoring the rest of the data, we can mitigate the above problems and cluster time series of different lengths, e.g., cluster one heartbeat with multiple heartbeats. To achieve this, we exploit and extend a recently introduced concept in time series data mining called shapelets. Unlike existing work, our work demonstrates the unintuitive fact that shapelets can be learned from unlabeled time series. We show, with extensive empirical evaluation in diverse domains, that our method is more accurate than existing methods. Moreover, in addition to accurate clustering results, we show that our work also has the potential to give insight into the domains to which it is applied. While a brute-force algorithm to discover shapelets in an unsupervised way could be untenably slow, we introduce two novel optimization procedures to significantly speed up the unsupervised-shapelet discovery process and allow it to be cast as an anytime algorithm.
• work in progress(2016)
We empirically evaluate various stack-based LSM merge-compaction policies, including the default policy for Google Bigtable, Cassandra's size-tiered policy, HBase's exploring policy, and new policies proposed in "Online Bigtable merge compaction". We consider in depth the setting where, to keep read-response time low, the policy must manage compactions to keep the stack size below a specified parameter K. We evaluate the policies by implementing them within the AsterixDB NoSQL database. We also empirically partially validate the theoretical model for stack-based LSM merge-compaction proposed in "Online Bigtable merge compaction".
• A brief summary of greedy set-cover algorithms.

INDEX TERMS: dominating set, greedy algorithm, hitting set, set cover, minimizing a linear function subject to a submodular constraint.
• working paper(2016)
We describe the first nearly linear-time approximation algorithms for explicitly given mixed packing/covering linear programs, and for (non-metric) fractional facility location. We also describe the first parallel algorithms requiring only near-linear total work and finishing in polylog time. The algorithms compute $(1+\epsilon)$-approximate solutions in time (and work) $\tilde O(N/\epsilon^2)$, where $N$ is the number of non-zeros in the constraint matrix. For facility location, $N$ is the number of eligible client/facility pairs.
• working paper(2016)
NoSQL databases are widely used for massive data storage and real-time web applications. Yet important aspects of these data structures are not well understood. For example, NoSQL databases write most of their data to a collection of files on disk, meanwhile periodically compacting subsets of these files. A compaction policy must choose which files to compact, and when to compact them, without knowing the future workload. Although these choices can affect computational efficiency by orders of magnitude, existing literature lacks tools for designing and analyzing online compaction policies --- policies are now chosen largely by trial and error.

Here we introduce tools for the design and analysis of compaction policies for Google Bigtable, propose new policies, give average-case and worst-case competitive analyses, and present preliminary empirical benchmarks.
• A brief summary of online paging and caching results, 1985-2013.

INDEX TERMS: paging, caching, weighted caching, weighted paging, file caching, least recently used (paging algorithm), first in first out (paging algorithm), flush when full (paging algorithm), the Marking algorithm (paging algorithm), Balance algorithm (weighted caching algorithm), Greedy Dual (weighted caching algorithm), Landlord (file caching algorithm), Squid (file caching software), k-server problem, primal-dual algorithms, randomized algorithms, online algorithms, competitive analysis, competitive ratio, loose competitiveness, access-graph model, Markov paging.
• Journal of Scheduling 18(6):545-560(2015); (ICALP '13);
The Joint Replenishment Problem (JRP) is a fundamental optimization problem in supply-chain management, concerned with optimizing the flow of goods over time from a supplier to retailers. Over time, in response to demands at the retailers, the supplier sends shipments, via a warehouse, to the retailers. The objective is to schedule shipments to minimize the sum of shipping costs and retailers' waiting costs. We study the approximability of JRP with deadlines, where instead of waiting costs the retailers impose strict deadlines. We study the integrality gap of the standard linear-program (LP) relaxation, giving a lower bound of 1.207, and an upper bound and approximation ratio of 1.574. The best previous upper bound and approximation ratio was 1.667; no lower bound was previously published. For the special case when all demand periods are of equal length we give an upper bound of 1.5, a lower bound of 1.2, and show APX-hardness.
Journal version of .
• This paper gives a lower bound on the complexity of $(1 \pm \epsilon)$-approximately solving a packing or covering problem using a certain class of Lagrangian-relaxation algorithms: any such algorithm, given a problem formed by a random {0,1}-matrix, requires (with high probability) a number of iterations proportional to $(\rho \epsilon)^2\log m$. (Here $\rho$ is a technical parameter, thewidth'' of the problem instance.)

The class of algorithms in question includes Dantzig-Wolfe decomposition, Benders decomposition, the Lagrangian-relaxation method developed by Held and Karp  for lower-bounding TSP, and algorithms recently studied by many authors (including Serge Plotkin, David Shmoys, and Eva Tardos ; Mike Grigoriadis and L.G. Khachiyan ; Michael Luby and Noam Nisan ; Naveen Garg and Jochen Konemann ; and Lisa Fleischer ). The lower bound matches the known upper bounds within a constant factor. The lower bound is useful because, in practice, the dependence on $\epsilon$ limits the applicability of these algorithms. The lower bound provides some insight into what is necessary to surmount this dependence.
Journal version of . Published online Aug. 2015.
• In 1971, Knuth gave an $O(n^2)$-time algorithm for the classic problem of finding an optimal binary search tree. Knuth's algorithm works only for search trees based on 3-way comparisons, while most modern computers support only 2-way comparisons (e.g., $<$, $\le$, $=$, $\ge$, and $>$). Until this paper, the problem of finding an optimal search tree using 2-way comparisons remained open --- poly-time algorithms were known only for restricted variants. We solve the general case, giving (i) an $O(n^4)$-time algorithm and (ii) an $O(n\log n)$-time additive-3 approximation algorithm.

For finding optimal binary split trees, we (iii) obtain a linear speedup and (iv) prove some previous work incorrect.
• A brief summary of greedy algorithms for set cover and vertex cover.

INDEX TERMS: greedy algorithm, set cover, vertex cover.
• Algorithmica 70(4):648-674(2014); FOCS'07
We give an approximation algorithm for packing and covering linear programs (linear programs with non-negative coefficients). Given a constraint matrix with $N$ non-zeros, $r$ rows, and $c$ columns, the algorithm (with high probability) computes feasible primal and dual solutions whose costs are within a factor of $1+\epsilon$ of the optimal cost in time $O((r+c)\log(N)/\epsilon^2 + N)$.
Journal version of .
• Can one choose a good Huffman code without knowing the underlying distribution? The following Online Slot Allocation (OSA) problem models this and similar problems. Given $n$ nonnegative slot costs $c(1), c(2), ..., c(n)$, requests for items are drawn from a hidden probability distribution $p$ on $[n]$, and revealed online one by one. In response to each request, if the item has not been previously requested, the algorithm (knowing only $c$ and the items requested so far) must allocate some unused slot for the item. The objective is to minimize the cost of the slot allocation, which is the sum, over the items, of the probability of the item times the cost of its assigned slot.

The online First-Come-First-Served algorithm (FCFS) simply allocates the cheapest slot available when an item needing a slot arrives. FCFS is optimal among online algorithms. For general costs, we show that its competitive ratio is $1+\ln n$. For concave costs, its competitive ratio is 2. For logarithmic costs, the ratio is, asymptotically, 1: the algorithm's expected cost is $\mbox{opt} + O(\log \mbox{opt})$.

A practical application of OSA is Online Huffman Coding (OHC), a variant of Huffman coding in which the codewords must be allocated on demand, without knowing the underlying probability distribution. OHC reduces to OSA, yielding a first-come-first-served algorithm for OHC that guarantees near-optimal expected cost, at most $\mbox{opt} + 2 \log(1+\mbox{opt}) + 2$.
• Minimum-weight triangulation (MWT) is NP-hard. It has a polynomial-time constant-factor approximation algorithm, and a variety of effective polynomial- time heuristics that, for many instances, can find the exact MWT. Linear programs (LPs) for MWT are well-studied, but previously no connection was known between any LP and any approximation algorithm or heuristic for MWT. Here we show the first such connections: for an LP formulation due to Dantzig et al. (1985): (i) the integrality gap is bounded by a constant; (ii) given any instance, if the aforementioned heuristics find the MWT, then so does the LP. Our analysis improves the best approximation ratio known for MWT (which is an astronomically large constant) by an order of magnitude.
Journal version of .
• Conference version of .
• Algorithmica 66(1):113-152(2013); ICALP'09
This paper describes a simple greedy Δ-approximation algorithm for any covering problem whose objective function is submodular and non-decreasing, and whose feasible region can be expressed as the intersection of arbitrary (closed upwards) covering constraints, each of which constrains at most Δ variables. (A simple example is Vertex Cover, with Δ = 2.) The algorithm generalizes previous approximation algorithms for fundamental covering problems and online paging and caching problems.

(For distributed implementations of these algorithms, see Distributed algorithms for covering, packing and maximum weighted matching .)
Journal version of .
• Theory of Computing 9(22):685-702(2013)
Given a satisfiable 3-SAT formula, how hard is it to find an assignment to the variables that has Hamming distance at most $n/2$ to a satisfying assignment? More generally, consider any polynomial-time verifier for any NP-complete language. A $d(n)$-Hamming-approximation algorithm for the verifier is one that, given any member $x$ of the language, outputs in polynomial time a string $a$ with Hamming distance at most $d(n)$ to some witness $w$, where $(x,w)$ is accepted by the verifier. Previous results have shown that, if P$\ne$NP, then every NP-complete language has a verifier for which there is no $(n/2-n^{2/3+\delta})$-Hamming-approximation algorithm, for various constants $\delta\ge 0$.

Our main result is that, if P$\ne$NP, then every paddable NP-complete language has a verifier that admits no $(n/2+O(\sqrt{n\log n}))$-Hamming-approximation algorithm. That is, one cannot get even half the bits right. We also consider natural verifiers for various well-known NP-complete problems. They do have $n/2$-Hamming-approximation algorithms, but, if P$\ne$NP, have no $(n/2-n^\epsilon)$-Hamming-approximation algorithms for any constant $\epsilon>0$.

We show similar results for randomized algorithms.
First draft circulated in 2003.
• We study the generalization of Huffman Coding in which codeword letters have non-uniform costs (as in Morse code, where the dash is twice as long as the dot). Despite previous work by many authors including Richard Karp  and Kurt Mehlhorn , the problem is not known to be NP-hard, nor was it previously known to have a constant-factor approximation algorithm. The paper describes a polynomial-time approximation scheme (PTAS) for the problem. The algorithm computes a $(1+\epsilon)$-approximate solution in time $O(n + f(\epsilon) \log^3 n)$, where $n$ is the input size.
Journal version of .
• Conference version of .
• This paper gives poly-logarithmic-round, distributed δ-approximation algorithms for covering problems with submodular cost and monotone covering constraints (Submodular-cost Covering). The approximation ratio δ is the maximum number of variables in any constraint. Special cases include Covering Mixed Integer Linear Programs (CMIP), and Weighted Vertex Cover (with δ = 2).

Via duality, the paper also gives poly-logarithmic-round, distributed δ-approximation algorithms for Fractional Packing linear programs (where δ is the maximum number of constraints in which any variable occurs), and for Max Weighted c-Matching in hypergraphs (where δ is the maximum size of any of the hyperedges; for graphs δ = 2).

The paper also gives parallel (RNC) 2-approximation algorithms for CMIP with two variables per constraint and Weighted Vertex Cover.

The algorithms are randomized. All of the approximation ratios exactly match those of comparable centralized algorithms.

(This paper gives distributed implementations of algorithms from Greedy δ-approximation algorithm for covering with arbitrary constraints and submodular cost .)
Journal version of two conference papers, on [covering] and [packing].
• Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant adoption and extension in the community.

Despite their immense potential as a data mining primitive, there are two important limitations of shapelets. First, their expressiveness is limited to simple binary presence/absence questions. Second, even though shapelets are computed offline, the time taken to compute them is significant.

In this work, we address the latter problem by introducing a novel algorithm that finds shapelets in less time than current methods by an order of magnitude. Our algorithm is based on intelligent caching and reuse of computations, and the admissible pruning of the search space. Because our algorithm is so fast, it creates an opportunity to consider more expressive shapelet queries. In particular, we show for the first time an augmented shapelet representation that distinguishes the data based on conjunctions or disjunctions of shapelets. We call our novel representation Logical-Shapelets. We demonstrate the efficiency of our approach on the classic benchmark datasets used for these problems, and show several case studies where logical shapelets significantly outperform the original shapelet representation and other time series classification techniques. We demonstrate the utility of our ideas in domains as diverse as gesture recognition, robotics, and biometrics.
• A 40-page sketch of basic techniques in the design and analysis of combinatorial approximation algorithms. Sections: Introduction, Underlying principles, Approximation algorithms with small additive error, Performance guarantees, Randomized rounding and linear programming, Performance ratios and c-approximation, Polynomial approximation schemes, Constant-factor performance guarantees, Logarithmic performance guarantees, Multi-criteria problems, Hard-to-approximate problems, Research Issues and Summary, Defining Terms, Further Information.
• The paper presents distributed and parallel δ-approximation algorithms for covering problems, where δ is the maximum number of variables on which any constraint depends (for example, δ=2 for vertex cover).

Specific results include the following.
• For weighted vertex cover, the first distributed 2-approximation algorithm taking O(log n) rounds and the first parallel 2-approximation algorithm in RNC. The algorithms generalize to covering mixed integer linear programs (CMIP) with two variables per constraint (δ=2).
• For any covering problem with monotone constraints and submodular cost, a distributed $δ$-approximation algorithm taking O(log2 |C|) rounds, where |C| is the number of constraints. (Special cases include CMIP, facility location, and probabilistic (two-stage) variants of these problems.)

(This paper gives distributed implementations of algorithms from Greedy δ-approximation algorithm for covering with arbitrary constraints and submodular cost .)
• We present efficient distributed δ-approximation algorithms for fractional packing and maximum weighted b-matching in hypergraphs, where δ is the maximum number of packing constraints in which a variable appears (for maximum weighted b-matching δ is the maximum edge degree - for graphs δ = 2). (a) For δ = 2 the algorithm runs in O(log m) rounds in expectation and with high probability. (b) For general δ, the algorithm runs in O(log2 m) rounds in expectation and with high probability.

(This paper gives distributed algorithms for the duals of problems considered in Greedy δ-approximation algorithm for covering with arbitrary constraints and submodular cost .)
• Conference version of .
• With fully directional communications, nodes must track and periodically update the positions of their discovered neighbors, so that communication with these neighbors is feasible when needed. If tracking fails, neighbors that move out of the directional footprint will need to be rediscovered. The tracking process introduces an overhead, which increases with the number of discovered neighbors. The overhead can be reduced if each node maintains only a subset of its neighbors; however, this may increase the hop count of paths between nodes, i.e., causes a hop stretch. In this work, we study the tradeoffs between node degree and hop stretch. We first design a topology control algorithm to optimize this tradeoff. For the purposes of this design, we assume that nodes perform circular directional transmissions to communicate with the nodes in their directional range; in this way, the network can be modeled as a unit disk graph (UDG). Given a UDG G, our algorithm finds a sparse subgraph G' with a maximum node degree of 6, and connecting each node pair u, v by a path of length hopsG'(u,v) = O(hopsG(u,v) + log Δ), where Δ is the maximum degree in G, and hopsG(u,v) denotes the length of the shortest path between u and v in graph G. We show that this result is near optimal. Based on the insights gained from the above design, we next construct a more practical scheme, which integrates topology control with fully directional neighbor discovery and maintenance. The simulated performance of our practical scheme is only slightly worse than the theoretical optimal performance in terms of node degree and path stretch.
Journal version of .
• A brief summary of greedy set-cover algorithms.

INDEX TERMS: dominating set, greedy algorithm, hitting set, set cover, minimizing a linear function subject to a submodular constraint.
• Following Mettu and Plaxton, we study incremental algorithms for the $k$-medians problem. Such an algorithm must produce a nested sequence $F_1 \subseteq F_2 \subseteq \cdots \subseteq F_n$ of sets of facilities. Mettu and Plaxton show that incremental metric medians has a (roughly) $40$-competitive deterministic polynomial-time algorithm. We give improved algorithms, including a $(24+ε)$-competitive deterministic polynomial-time algorithm and a $5.44$-competitive, randomized, non-polynomial-time algorithm.

We also consider the competitive ratio with respect to size. An algorithm is $s$-size-competitive if, for each $k$, the cost of $F_k$ is at most the minimum cost of any set of $k$ facilities, while the size of $F_k$ is at most $s k$. We present optimally competitive algorithms for this problem.

Our proofs reduce incremental medians to the following online bidding problem: faced with some unknown threshold $T>0$, an algorithm must submit bids'' $b>0$ until it submits a bid as large as $T$. The algorithm pays the sum of its bids. We describe optimally competitive algorithms for online bidding.

Our results on cost-competitive incremental medians extend to approximately metric distance functions, incremental fractional medians, and incremental bicriteria approximation.
Journal version of .
• A brief summary of online paging and caching results, 1985-2002.

INDEX TERMS: paging, caching, weighted caching, weighted paging, file caching, least recently used (paging algorithm), first in first out (paging algorithm), flush when full (paging algorithm), the Marking algorithm (paging algorithm), Balance algorithm (weighted caching algorithm), Greedy Dual (weighted caching algorithm), Landlord (file caching algorithm), Squid (file caching software), k-server problem, primal-dual algorithms, randomized algorithms, online algorithms, competitive analysis, competitive ratio, loose competitiveness, access-graph model, Markov paging.
• We study the problem of selecting control clones in DNA array hybridization experiments. The problem arises in the OFRG method for analyzing microbial communities. The OFRG method performs classification of rRNA gene clones using binary fingerprints created from a series of hybridization experiments, where each experiment consists of hybridizing a collection of arrayed clones with a single oligonucleotide probe. This experiment produces analog signals, one for each clone, which then need to be classified, that is, converted into binary values 1 and 0 that represent hybridization and non-hybridization events. In addition to the sample rRNA gene clones, the array contains a number of control clones needed to calibrate the classification procedure of the hybridization signals. These control clones must be selected with care to optimize the classification process. We formulate this as a combinatorial optimization problem called Balanced Covering. We prove that the problem is NP-hard, and we show some results on hardness of approximation. We propose approximation algorithms based on randomized rounding, and we show that, with high probability, our algorithms approximate well the optimum solution. The experimental results confirm that the algorithms find high quality control clones. The algorithms have been implemented and are publicly available as part of the software package called CloneTools.
Journal version of .
• Conference version of .
• Conference version of .
• Given a rooted tree whose leaves have numeric labels, label the nodes so that each leaf label equals the sum of the node labels from the root to the leaf. Minimize the number of non-zero node labels. This problem models finding a parsimonius explanation of observed changes in hierarchical data. We give fast algorithms with applications.
• An introduction to greedy approximation algorithms for combinatorial optimization problems. Informal definition of greedy algorithms. Greedy Set Cover, and other generalizations and applications. Greedy algorithms for other problems. Lagrangean relaxation based methods. Local ratio, greedy dual based schemes and greedy rounding. Connected Dominating Sets and other applications. Priority Based Approaches.
• Poster in Proceedings of the IEEE International Conference on Database Engineering (ICDE),
Poster presentation of .
• This paper is a follow-up to Topology control to simultaneously achieve near-optimal node degree and low path stretch in ad hoc networks. It gives a related heuristic with more extensive empirical evaluation.
• Conference version of .
• The Reverse Greedy algorithm (RGreedy) for the $k$-median problem works as follows. It starts by placing facilities on all nodes. At each step, it removes a facility to minimize the resulting total distance from the customers to the remaining facilities. It stops when $k$ facilities remain. We prove that, if the distance function is metric, then the approximation ratio of RGreedy is between $Ω(\log n/ \log \log n)$ and $O(\log n)$.
Journal version of .
• Conference version of .
• Given matrices A and B and vectors a, b, c and d, all with non-negative entries, we consider the problem of minimizing c$\cdot$x subject to x$\in Z^n_+$, Ax ≥ a, Bx ≤ b, and x ≤ d. We give a bicriteria-approximation algorithm that, given ε$\in$(0, 1], finds a solution of cost O(log(m)/ε2) times optimal, meeting the covering constraints Ax≥a and multiplicity constraints (x≤d), and satisfying Bx≤(1 + ε)b + β, where β is the vector of row sums, β_i=$\sum_j$ Bij. Here m is the number of rows of A.

This gives an O(log m)-approximation algorithm for CIP --- minimum-cost covering integer programs with multiplicity constraints, i.e., the special case when there are no packing constraints Bx ≤ b. The previous best approximation ratio has been O(log maxj $\sum_i$Aij since 1982. CIP contains the set cover problem as a special case, so O(log m)-approximation is the best possible unless P=NP.
• Conference version of .
• The multiway-cut problem is, given a weighted graph and $k > 2$ terminal nodes, to find a minimum-weight set of edges whose removal separates all the terminals. The problem is NP-hard, and even NP-hard to approximate within $1+δ$ for some small $δ > 0$.

This paper gives a 12/11-approximation algorithm for $k=3$ and a 1.3438-approximation algorithm in general. Working on this result was particularly fun because the problem of designing the algorithm was itself formulated as an optimization problem and approximately solved by computer. These computations guided the final (non-computational) solutions.

The paper also gives a proof that for a general class of geometric embedding'' relaxations, there are always randomized rounding schemes that match the integrality gap. (Another example of such a geometric-embedding relaxation is the semidefinite-programming relaxation of max-cut by Goemans and Williamson .)

The geometric relaxation is due to Calinescu, Karloff, and Rabani . After our paper, the approximation ratio for general $k$ was subsequently improved to 1.3238 in .

Journal version of .
• The Astronomical Journal 125:2276-2286(2003)
This paper describes a modification to the heuristic algorithm presented in Data collection for the Sloan Digital Sky Survey: a network-flow heuristic.'' The modification takes into account that galaxies that are too close together cannot be captured in the same snapshot. This was not considered in the original algorithm.
• Conference version of .
• Algorithmica 33(3):371-383(2002); SODA'98
This paper introduced a new file-caching algorithm, called Landlord or Greedy-Dual-size, for use by web proxies and browsers. The paper gave theoretical analyses of the algorithm suggesting that it would perform well in practice.

A modified version of the algorithm is incorporated into the public-domain Squid web proxy.

The algorithm was independently obtained by Pei Cao and Sandy Irani . Both they and John Dilley et al  found that the algorithm worked well empirically.

The results in this paper strengthen and generalize those in  and are further generalized in .
Journal version of .
• Congestion control in the current Internet is accomplished mainly by TCP/IP. To understand the macroscopic network behavior that results from TCP/IP and similar end-to-end protocols, one main analytic technique is to show that the the protocol maximizes some global objective function of the network traffic.

Here we analyze a particular end-to-end, MIMD (multiplicative-increase, multiplicative-decrease) protocol. We show that if all users of the network use the protocol, and all connections last for at least logarithmically many rounds, then the total weighted throughput (value of all packets received) is near the maximum possible. Our analysis includes round-trip-times, and (in contrast to most previous analyses) gives explicit convergence rates, allows connections to start and stop, and allows capacities to change.

An application explained in the paper is multi-path bandwidth measurement.
• Mixed packing and covering problems are problems that can be formulated as linear programs using only non-negative coefficients. Examples include multicommodity network flow, the Held-Karp lower bound on TSP, fractional relaxations of set cover, bin-packing, knapsack, scheduling problems, minimum-weight triangulation, etc. This paper gives approximation algorithms for the general class of problems. The sequential algorithm can be implemented to find an $(1\pmε)$-approximate solution in $O(ε^{-2}\log m)$ linear-time iterations.

For $ε = O(1)$, these algorithms are currently the fastest known for the general problem. The results generalize previous work on pure packing and covering (the special case when the constraints are all less-than'' or all greater-than'') by Michael Luby and Noam Nisan ; and Naveen Garg and Jochen Konemann ; and Lisa Fleischer 
• The paper gives approximation algorithms for the $k$-medians and facility-location problems (both NP-hard). For $k$-medians, the algorithm returns a solution using at most $\ln(n+n/ε) k$ medians and having cost at most $(1+ε)$ times the cost of the best solution that uses at most $k$ medians. Here $ε>0$ is an input to the algorithm. In comparison, the best previous algorithm [Lin and Jeff Vitter, 1992] had a $(1+1/ε)\ln n$ term instead of the $\ln(n+n/ε)$ term in the performance guarantee. For facility location, the algorithm returns a solution of cost at most $d+\ln(n) k$, provided there exists a solution of cost $d+k$ where $d$ is the assignment cost and $k$ is the facility cost. In comparison, the best previous algorithm [Dorit Hochbaum, 1982] returned a solution of cost at most $\ln(n)(d+k)$. For both problems, the algorithms currently provide the best performance guarantee known for the general (non-metric) problems.

The paper also introduces a new probabilistic bound (the so-called Chernoff-Wald bound) for bounding the expectation of the maximum of a collection of sums of random variables, when each sum contains a random number of terms. The bound is used to analyze the randomized rounding scheme that underlies the algorithms.
• Journal of Algorithms 37(1):218-235(2000); SODA'98
Elias Koutsoupias and Christos Papadimitriou introduced the so-called diffuse-adversary model for analyzing paging strategies such as least-recently-used (LRU) . Briefly, the diffuse adversary model is as follows. Fix some $ε >0$ Run the algorithm, and, for each request, allow the adversary to assign a probability distribution to the items such that no item gets probability more than $ε$, then request an item randomly from the distribution. (When $ε$ is near 1, one has the standard competitive-analysis model; when $ε$ is near 0, the adversary is severely handicapped.) The algorithm is $c$-competitive if the expected cost of the algorithm on the adversarial sequence is at most $c$ times the expected cost of the optimal algorithm.

Koutsoupias and Papadimitriou proved that LRU was optimally competitive (among deterministic on-line algorithms) according to the model, but left open the question of what the performance guarantee for LRU actually was. This paper answers the question, giving an explicit formula for the performance guarantee within a factor of two. The paper also shows that the performance guarantee of first-in-first-out (FIFO) is much worse than that of LRU, and analyzes the performance guarantees of randomized on-line paging strategies.
Journal version of Bounding the diffuse adversary.
• The data broadcast problem is to find a schedule for broadcasting a given set of messages over multiple channels. The goal is to minimize the cost of the broadcast plus the expected response time to clients who periodically and probabilistically tune in to wait for particular messages.

The problem models disseminating data to clients in asymmetric communication environments, where there is a much larger capacity from the information source to the clients than in the reverse direction. Examples include satellites, cable TV, Internet broadcast, and mobile phones. Such environments favor the push-based'' model where the server broadcasts (pushes) its information on the communication medium and multiple clients simultaneously retrieve the specific information of individual interest.

This paper presents the first polynomial-time approximation scheme (PTAS) for data broadcast with $O(1)$ channels and when each message has arbitrary probability, unit length and bounded cost. The best previous polynomial-time approximation algorithm for this case has a performance ratio of 9/8 [Bar-Noy et al, 2000].
• A 40-page sketch of basic techniques in the design and analysis of combinatorial approximation algorithms. Sections: Introduction, Underlying principles, Approximation algorithms with small additive error, Performance guarantees, Randomized rounding and linear programming, Performance ratios and c-approximation, Polynomial approximation schemes, Constant-factor performance guarantees, Logarithmic performance guarantees, Multi-criteria problems, Hard-to-approximate problems, Research Issues and Summary, Defining Terms, Further Information.
• Two common objectives for evaluating a schedule are the makespan, or schedule length, and the average completion time. This short note gives improved bounds on the existence of schedules that simultaneously optimize both criteria. In particular, for any ρ > 0, there exists a schedule of makespan at most 1+ρ times the minimum, with average completion time at most 1/(1-e) times the minimum. The proof uses an infinite-dimensional linear program to generalize and strengthen a previous analysis by Cliff Stein and Joel Wein .
• Conference version of .
• Conference version of .
• The goal of the Sloan Digital Sky Survey is to map in detail one-quarter of the entire sky, determining the positions and absolute brightnesses of more than 100 million celestial objects'' . The survey will be performed by taking snapshots'' through a large telescope. Each snapshot can capture up to 600 objects from a small circle of the sky (as illustrated here). This paper describes the design and implementation of the algorithm that is being used to determine the snapshots so as to minimize their number. The problem is NP-hard in general; the algorithm described is a heuristic, based on Lagrangian-relaxation and min-cost network flow. It gets within 5-15% of a naive lower bound, whereas using a uniform'' cover only gets within 25-35%.
Journal version of .
• Conference version of .
• Pattern-matching-based document-compression systems (e.g. for faxing) rely on finding a small set of patterns that can be used to represent all of the ink in the document. Finding an optimal set of patterns is NP-hard; previous compression schemes have resorted to heuristics. This paper describes an extension of the cross-entropy approach, used previously for measuring pattern similarity, to this problem. This approach reduces the problem to a k-medians problem, for which the paper gives a new algorithm with a provably good performance guarantee. In comparison to previous heuristics (First Fit, with and without generalized Lloyd's/k-means post-processing steps), the new algorithm generates a better codebook, resulting in an overall improvement in compression performance of almost 17%.
• Journal of Algorithms 24(2):310-324(1997)
The problem considered is the following. Given a graph with edge weights satisfying the triangle inequality, and a degree bound for each vertex, compute a low-weight spanning tree such that the degree of each vertex is at most its specified bound. The problem is NP-hard (it generalizes Traveling Salesman (TSP)). The paper describes a network-flow heuristic for modifying a given tree T to meet the constraints. Choosing T to be a minimum spanning tree (MST) yields approximation algorithms with performance guarantee less than 2 for the problem on geometric graphs with Lp-norms. The paper also describes a Euclidean graph whose minimum TSP costs twice the MST, disproving a conjecture made in [Low-degree spanning trees of small weight].
• The paper focuses on two problems: (i) how to orient the edges of an undirected graph in order to maximize the number of ordered vertex pairs $(x,y)$ such that there is a directed path from $x$ to $y$, and (ii) how to orient the edges so as to minimize the number of such pairs. The paper describes a quadratic-time algorithm for the first problem, and a proof that the second problem is NP-hard to approximate within some constant $1+ε > 1$. The latter proof also shows that the second problem is equivalent to comparability graph completion''; neither problem was previously known to be NP-hard.
• J. Combinatorial Theory, Series A 76(1):44-54(1996)
N.J.A. Sloane maintains an on-line encyclopedia of interesting integer sequences. This paper presents an operation for modifying such a sequence, and analyzes the operation using generating functions. If you are interested, search for the word boustrophedon'' at  or go here.
• Conference version of .
• The degree-d spanning tree problem asks for a minimum-weight spanning tree in which the degree of each vertex is at most $d$. When $d=2$ the problem is TSP, and in this case, the well-known Christofides algorithm provides a 1.5-approximation algorithm (assuming the edge weights satisfy the triangle inequality).

Christos Papadimitriou and Umesh Vazirani posed the challenge of finding an algorithm with performance guarantee less than 2 for Euclidean graphs (points in $R^n$) and $d > 2$ . This paper gives the first answer to that challenge, presenting an algorithm to compute a degree-3 spanning tree of cost at most 5/3 times the MST. For points in the plane, the ratio improves to 3/2 and the algorithm can also find a degree-4 spanning tree of cost at most 5/4 times the MST.
Journal version of .
• The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard; This paper gives a proof that, for graphs where each directed cycle has at most three edges, the MEG problem is equivalent to maximum bipartite matching, and therefore solvable in polynomial time. This leads to an improvement in the performance guarantee of the previously best approximation algorithm for the general problem, from Approximating the minimum equivalent digraph'' .

(This result was improved by Berman et al, WADS, 2009.)
• An experimental implementation and evaluation of data structures from Approximate data structures with applications'' . To summarize: Our results suggest that the data structure is practical and can be faster than traditional priority queues when holding a large number of keys, and that tolerance for approximate answers can lead to significant increases in speed.''
• Describes a near-linear-time algorithm for a variant of Huffman coding, in which the letters may have non-uniform lengths (as in Morse code), but with the restriction that each word to be encoded has equal probability.

Journal version of .
• The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard. This paper gives an approximation algorithm with performance guarantee of pi2/6 ~ 1.64. The algorithm and its analysis are based on the simple idea of contracting long cycles.

(This result was strengthened slightly in On strongly connected digraphs with bounded cycle length'' .) The analysis applies directly to 2-Exchange, a simple local improvement algorithm, showing that its performance guarantee is 1.75. It was further improved by Berman et al, WADS 2009.)
Journal version of .
• Algorithmica 14(4):305-322(1995); SODA'93
This paper has a simple linear-time algorithm to find a spanning tree that simultaneously approximates a shortest-path tree and a minimum spanning tree. The algorithm provides a continuous trade-off: given the two trees and $ε > 0$, the algorithm returns a spanning tree in which the distance between any vertex and the root of the shortest-path tree is at most $1+ε$ times the shortest-path distance, and yet the total weight of the tree is at most $1+2/ε$ times the weight of a minimum spanning tree. This is the best trade-off possible.
Journal version of .
• Randomized rounding is a standard method, based on the probabilistic method, for designing combinatorial approximation algorithms. This paper introduces a variant of randomized rounding that can be used to derive Lagrangian-relaxation algorithms and greedy algorithms. This was surprising - the approximation algorithms derived previously by the method require first solving a linear program (or some other optimization problem). Indeed, in Prabhakar Raghavan's seminal paper introducing randomized rounding , he says: "The time taken to solve the linear program relaxations of the integer programs dominates the net running time theoretically (and, most likely, in practice as well)."

The variant of randomized rounding described here avoids this bottleneck, at least for packing/covering-type problems. The resulting algorithms are Lagrangian-relaxation algorithms and greedy algorithms (in fact the greedy set-cover algorithm is an example), and are faster and simpler to implement than standard randomized-rounding algorithms. This approach gives a systematic, coherent, and comprehensive understanding of Lagrangian-relaxation algorithms for packing and covering linear programs.

The ideas introduced here are also used in Sequential and parallel algorithms for mixed packing and covering'' , in K-medians, facility location, and the Chernoff-Wald bound'' ), and in On the number of iterations for Dantzig-Wolfe optimization and packing-covering approximation algorithms'' .
Lecture notes on this topic are here.
• Cornell University School of ORIE Technical Report 1103(1994)
We consider the problem of choosing Euclidean points to maximize the sum of their weighted pairwise distances, when each point is constrained to a ball centered at the origin. We derive a dual minimization problem and show strong duality holds (i.e., the resulting upper bound is tight) when some locally optimal configuration of points is affinely independent. We sketch a polynomial time algorithm for finding a near-optimal set of points.
• The paper describes a simple deterministic parallel/distributed $(2+ε)$-approximation algorithm for the minimum-weight vertex-cover problem and its dual (edge/element packing). This paper was one of the first to use the primal-dual method for approximation in the distributed setting. This result was strengthened in  Distributed and parallel algorithms for weighted vertex cover and other covering problems.''
Journal version of .
• This paper explores the notion of approximate data structures, which return approximately correct answers to queries, but run faster than their exact counterparts. The paper describes approximate variants of the van Emde Boas data structure, which support the same dynamic operations as the standard van Emde Boas data structure (min, max, successor, predecessor, and existence queries, as well as insertion and deletion), except that answers to queries are approximate. The variants support all operations in constant time provided the performance guarantee is 1+1/polylog n, and in O(log log n) time provided the performance guarantee is 1+1/poly(n), for n elements in the data structure.

Applications described include Prim's minimum-spanning-tree algorithm, Dijkstra's single-source shortest paths algorithm, and an on-line variant of Graham's convex hull algorithm. To obtain output which approximates the desired output with the performance guarantee tending to 1, Prim's algorithm requires only linear time, Dijkstra's algorithm requires O(m log log n) time, and the on-line variant of Graham's algorithm requires constant amortized time per operation.
• Conference version of .
• The traditional multi-commodity flow problem assumes a given flow network in which multiple commodities are to be maximally routed in response to given demands. This paper considers the multi-commodity flow network-design problem: given a set of multi-commodity flow demands, find a network subject to certain constraints such that the commodities can be maximally routed.

This paper focuses on the case when the network is required to be a tree. The main result is an approximation algorithm for the case when the tree is required to be of constant degree. The algorithm reduces the problem to the minimum-weight balanced-separator problem; the performance guarantee of the algorithm is within a factor of 4 of the performance guarantee of the balanced-separator procedure. If Tom Leighton and Satish Rao's balanced-separator procedure  is used, the performance guarantee is $O(\log n)$. This improves the $O(\log^2 n)$ approximation factor that is trivial to obtain by a direct application of the balanced-separator method.
Journal version of .
• Conference version of .
• Conference version of .
• Von Neumann's Min-Max Theorem guarantees that each player of a zero-sum matrix game has an optimal mixed strategy. This paper gives an elementary proof that each player has a near-optimal mixed strategy that chooses uniformly at random from a multi-set of pure strategies of size logarithmic in the number of pure strategies available to the opponent.

For exponentially large games, for which even representing an optimal mixed strategy can require exponential space, it follows that there are near-optimal, linear-size strategies. These strategies are easy to play and serve as small witnesses to the approximate value of the game.

As a corollary, it follows that every language has small hard'' multi-sets of inputs certifying that small circuits can't decide the language. For example, if SAT does not have polynomial-size circuits, then, for each $n$ and $c$, there is a set of $n^{O(c)}$ Boolean formulae of size $n$ such that no circuit of size $n^c$ (or algorithm running in time $n^c$) classifies more than two-thirds of the formulae successfully.

The "simple strategies" observation was generalized to Nash equilibrium by Lipton et al. in their paper Playing large games using simple strategies.
• Algorithmica 11(6):525-541(1994); SODA'91
This paper has two results. The first is based on the surprising observation that the well-known least-recently-used'' paging algorithm and the balance'' algorithm for weighted caching are linear-programming primal-dual algorithms. This observation leads to a strategy (called Greedy-Dual'') that generalizes them both and has an optimal performance guarantee for weighted caching.

For the second set of results, the paper presents empirical studies of paging algorithms, documenting that in practice, on typical'' cache sizes and sequences, the performance of paging strategies are much better than their worst-case analyses in the standard model suggest. The paper then presents theoretical results that support and explain this. For example: on any input sequence, with almost all cache sizes, either the performance guarantee of least-recently-used is $O(\log k)$ or the fault rate (in an absolute sense) is insignificant.

These results are strengthened and/or generalized in  and [ 2009].
Journal version of .
• Conference version of .
• Conference version of .
• Conference version of .
• Journal of Algorithms 12(4):685-699(1991)
The paging problem is that of deciding which pages to keep in a memory of $k$ pages in order to minimize the number of page faults. This paper introduced the marking algorithm , a randomized on-line algorithm for the paging problem, and a proof that its performance guarantee is $O(\log k)$. This was one of the first results in the area showing that a randomized algorithm can have a strictly better performance guarantee than any deterministic on-line algorithm (no deterministic on-line algorithm can have a performance guarantee better than k).
• Technical Report, Computer Science Department, Princeton University CS-TR-348-91(1991)
The thesis was about on-line algorithms. The main results were published in papers described above. The unpublished results were mainly exploring the role of linear-programming duality in on-line algorithms, including a method for deriving a potential-function analysis from a linear-programming duality analysis, and studies of some on-line matching algorithms. There were also some minor results analyzing paging strategies with lookahead.
• Networks 21(2):205-221(1991)
The parametric shortest path problem is to find the shortest paths in graph where the edge costs are of the form $w_{ij}+\lambda$, where each $w_{ij}$ is constant and $\lambda$ is a parameter that varies. The problem is to find shortest path trees for every possible value of $\lambda$.

The minimum-balance problem is to find a weighting'' of the vertices so that adjusting the edge costs by the vertex weights yields a graph in which, for every cut, the minimum weight of any edge crossing the cut in one direction equals the minimum weight of any edge crossing the cut in the other direction.

The paper presents the fastest known algorithms for both problems. The algorithms run in $O(nm+n^2\log n)$ time. The paper also describes empirical studies of the algorithms on random graphs, suggesting that the expected time for finding minimum-mean cycles (an important special case of both problems) is $O(m + n \log n)$.
• Technical Report, Computer Science Department, Princeton University CS-TR-317-91(1991)
This technical report presented notes from the first eight lectures of the class Many Models of Complexity taught by László Lovász at Princeton University in the fall of 1990. The topic is evasiveness of graph properties: given a graph property, how many edges of the graph an algorithm must check in the worst case before it knows whether the property holds.