publications - Neal E. Young

neal young / publications

working paper(2025)
We give a polynomial-time approximation algorithm for the (not necessarily metric) $k$-Median problem. The algorithm is an $\alpha$-size-approximation algorithm for $\alpha < 1+2\ln(n/k)$. That is, it guarantees a solution having size at most $\alpha\times k$, and cost at most the cost of any size-$k$ solution. This is the first polynomial-time approximation algorithm to match the well-known bounds of $H_{\Delta}$ and $1+\ln(n/\textsf{opt})$ for unweighted Set Cover (a special case) within a constant factor. It matches these bounds within a factor of 2. The algorithm runs in time $O(k\,m\log(n/k)\log m)$, where $n$ is the number of customers and $m$ is the instance size.
- more...
- full text
- bibtex
- google scholar
- link
Algorithmica 87:43(2025)
With Marek Chrobak, Samuel Haney, Mehraneh Liaee, Debmalya Panigrahi, Rajmohan Rajaraman and Ravi Sundaram
It is natural to generalize the online $k$-Server problem by allowing each request to specify not only a point $p$, but also a subset $S$ of servers that may serve it. To initiate a systematic study of this generalization, we focus on uniform and star metrics. For uniform metrics, the problem is equivalent to a generalization of Paging in which each request specifies not only a page $p$, but also a subset $S$ of cache slots, and is satisfied by having a copy of $p$ in some slot in $S$. We call this problem Slot-Heterogenous Paging.

In realistic settings only certain subsets of cache slots or servers would appear in requests. Therefore we parameterize the problem by specifying a family $\mathcal S \subseteq 2^{[k]}$ of requestable slot sets, and we establish bounds on the competitive ratio as a function of the cache size $k$ and family $\mathcal S$:
- If all request sets are allowed ($\mathcal S=2^{[k]}\setminus\{\emptyset\}$), the optimal deterministic and randomized competitive ratios are exponentially worse than for standard Paging ($\mathcal S=\{[k]\}$).
- As a function of $|\mathcal S|$ and $k$, the optimal deterministic ratio is polynomial: at most $O(k^2|\mathcal S|)$ and at least $\Omega(\sqrt{|\mathcal S|})$.
- For any laminar family $\mathcal S$ of height $h$, the optimal ratios are $O(hk)$ (deterministic) and $O(h^2\log k)$ (randomized).
- The special case of laminar $\mathcal S$ that we call All-or-One Paging extends standard Paging by allowing each request to specify a specific slot to put the requested page in. The optimal deterministic ratio for weighted All-or-One Paging is $\Theta(k)$. Offline All-or-One Paging is NP-hard.
Some results for the laminar case are shown via a reduction to the generalization of Paging in which each request specifies a set $\mathcal P$ of pages, and is satisfied by fetching any page from $\mathcal P$ into the cache. The optimal ratios for the latter problem (with laminar family of height $h$) are at most $hk$ (deterministic) and $hH_k$ (randomized).
Journal version of [2023].
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM Transactions on Algorithms 21(2):1--19(2024)
With Marek Chrobak
Given a weighted, ordered query set $Q$ and a partition of $Q$ into classes, we study the problem of computing a minimum-cost decision tree that, given any query $q$ in $Q$, uses equality tests and less-than comparisons to determine the class to which $q$ belongs. Such a tree can be much smaller than a lookup table, and much faster and smaller than a conventional search tree. We give the first polynomial-time algorithm for the problem. The algorithm extends naturally to the setting where each query has multiple allowed classes.
Journal version of [2023].
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM Transactions on Algorithms 20(4):1-28(2024)
With Claire Mathieu, Rajmohan Rajaraman and Arman Yousefi^*
Data-structure dynamization is a general approach for making static data structures dynamic. It is used extensively in geometric settings and in the guise of so-called merge (or compaction) policies in big-data databases such as LevelDB and Google Bigtable. Previous theoretical work is based on worst-case analyses for uniform inputs - insertions of one item at a time and non-varying read rate. In practice, merge policies must not only handle batch insertions and varying read/write ratios, they can take advantage of such non-uniformity to reduce cost on a per-input basis.

To model this, we initiate the study of data-structure dynamization through the lens of competitive analysis, via two new online set-cover problems. For each, the input is a sequence of disjoint sets of weighted items. The sets are revealed one at a time. The algorithm must respond to each with a set cover that covers all items revealed so far. It obtains the cover incrementally from the previous cover by adding one or more sets and optionally removing existing sets. For each new set the algorithm incurs build cost equal to the weight of the items in the set. In the first problem the objective is to minimize total build cost plus total query cost, where the algorithm incurs a query cost at each time t equal to the current cover size. In the second problem, the objective is to minimize the build cost while keeping the query cost from exceeding $k$ (a given parameter) at any time. We give deterministic online algorithms for both variants, with competitive ratios of $\Theta(\log^* n)$ and $k$, respectively. The latter ratio is optimal for the second variant.
Journal version of [2021].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
Workshop on Algorithms and Data Structure (WADS)
With Marek Chrobak
Conference version of [2024].
Published in Lecture Notes in Computer Science (LNCS Volume 14079, July 2023).
- more...
- full text
- bibtex
- google scholar
- doi
- link
International Symposium on Theoretical Aspects of Computer Science (STACS)
With Marek Chrobak, Samuel Haney, Mehraneh Liaee, Debmalya Panigrahi, Rajmohan Rajaraman and Ravi Sundaram
Conference version of [2024].
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM Transactions on Algorithms 18(1):1-11(2022)
With Marek Chrobak, Mordecai Golin and J. Ian Munro
In 1971, Knuth gave an O(n²)-time algorithm for the classic problem of finding an optimal binary search tree. Knuth's algorithm works only for search trees based on 3-way comparisons, while most modern computers support only 2-way comparisons (e.g., <, <=, =, >=, and >). Until this paper, the problem of finding an optimal search tree using 2-way comparisons remained open --- poly-time algorithms were known only for restricted variants. We give an O(n⁴)-time algorithm for the general case. The algorithm and analysis are simpler than those for the previously solved special case.
See also [Chrobak21Cost, Chrobak22Huang, Chrobak15Optimal.]
- more...
- full text
- bibtex
- google scholar
- doi
- link
Acta Informatica 59:687-708(2022)
With Marek Chrobak, Mordecai Golin and J. Ian Munro
Huang and Wong [5] proposed a polynomial-time dynamic-programming algorithm for computing optimal generalized binary split trees. We show that their algorithm is incorrect. Thus, it remains open whether such trees can be computed in polynomial time. Spuler [11, 12] proposed modifying Huang and Wong's algorithm to obtain an algorithm for a different problem: computing optimal two-way-comparison search trees. We show that the dynamic program underlying Spuler's algorithm is not valid, in that it does not satisfy the necessary optimal-substructure property and its proposed recurrence relation is incorrect. It remains unknown whether the algorithm is guaranteed to compute a correct overall solution.
See also [Chrobak21Cost, Chrobak22Simple, Chrobak15Optimal.]
- more...
- full text
- bibtex
- google scholar
- doi
- link
The VLDB Journal 30:361-378(2021)
With Qizhong Mao, Steven Jacobs, Waleed Amjad, Vagelis Hristidis and Vassilis Tsotras
Modern NoSQL database systems use log-structured merge (LSM) storage architectures to support high write throughput. LSM architectures aggregate writes in a mutable MemTable (stored in memory), which is regularly flushed to disk, creating a new immutable file called an SSTable. Some of the SSTables are chosen to be periodically merged -- replaced with a single SSTable containing their union. A merge policy (a.k.a. compaction policy) specifies when to do merges and which SSTables to combine. A bounded depth merge policy is one that guarantees that the number of SSTables never exceeds a given parameter k, typically in the range 3--10. Bounded depth policies are useful in applications where low read latency is crucial, but they and their underlying combinatorics are not yet well understood. This paper compares several bounded depth policies, including representative policies from industrial NoSQL databases and two new ones based on recent theoretical modeling, as well as the standard Tiered policy and Leveled policy. The results validate the proposed theoretical model and show that, compared to the existing policies, the newly proposed policies can have substantially lower write amplification with comparable read amplification.
Journal version of Mao19Experimental.
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM-SIAM Symposium on Discrete Algorithms
With Claire Mathieu, Rajmohan Rajaraman and Arman Yousefi^*
Conference version of [2024].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
Information and Computation 218:11 pages(2021)
With Marek Chrobak, Mordecai Golin and J. Ian Munro
Optimal search trees can be used to implement access operations to a set of keys whose access probabilities are known. For a non-key query, such trees can determine its approximate location by returning the inter-key interval containing the query. This is in contrast to other dictionary data structures, like lists or hash tables, that only report a failed search. We address the question `what is the additional cost of determining approximate locations for non-key queries'? We prove that for two-way comparison trees this additional cost is at most 1. Our proof is technically interesting, as it is based on a novel probabilistic argument that involves converting a search tree that does not identify non-key queries into a random tree that does.
See also [Chrobak22Simple, Chrobak22Huang, Chrobak15Optimal]
- more...
- full text
- bibtex
- google scholar
- doi
- link
IEEE International Conference on Big Data
With Qizhong Mao, Steven Jacobs, Waleed Amjad, Vagelis Hristidis and Vassilis Tsotras
Conference version of Mao21Comparison.
- more...
- full text
- bibtex
- google scholar
- doi
- link
International Conference on Web Information Systems Engineering
With Nhat X.T. Le and Vagelis Hristidis
The paper considers the problem of summarising a collection of reviews for an item; the application considered is that of reviews for doctors. We consider the case where there are aspects about the item that are captured in the review, and are organized in an ontology in the form of a DAG. There is also sentiment about the aspects, so the reviews can be thought of as aspect/sentiment pairs. The goal is to find a subset of them to cover all the pairs, where coverage takes into account the ontology structure. We propose and analyse algorithms, and perform quantitative and qualitative experiments on real data.
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM International Conference on Advances in Geographic Information Systems
With Vincent Cohen-Addad and Philip Klein
We consider the problem of political redistricting: given the locations of people in a geographical area (e.g. a US state), the goal is to decompose the area into subareas, called districts, so that the populations of the districts are as close as possible and the districts are "compact" and "contiguous," to use the terms referred to in most US state constitutions and/or US Supreme Court rulings. We study a method that outputs a solution in which each district is the intersection of a convex polygon with the geographical area. The average number of sides per polygon is less than six. The polygons tend to be quite compact. Every two districts differ in population by at most one (so we call the solution balanced). In fact, the solution is a centroidal power diagram: each polygon has an associated center in R3 such that
1. the projection of the center onto the plane z = 0 is the centroid of the locations of people assigned to the polygon, and
2. for each person assigned to that polygon, the polygon center is closest among all centers. The polygons are convex because they are the intersections of 3D Voronoi cells with the plane.
The solution is, in a well-defined sense, a locally optimal solution to the problem of choosing centers in the plane and choosing an assignment of people to those 2-d centers so as to minimize the sum of squared distances subject to the assignment being balanced.

A practical problem with this approach is that, in real-world redistricting, exact locations of people are unknown. Instead, the input consists of polygons (census blocks) and associated populations. A real redistricting must not split census blocks. We therefore propose a second phase that perturbs the solution slightly so it does not split census blocks. In our experiments, the second phase achieves this while preserving perfect population balance. The district polygons are no longer convex at the fine scale because their boundaries must follow the boundaries of census blocks, but at a coarse scale they preserve the shape of the original polygons.
- more...
- full text
- bibtex
- google scholar
- doi
- link
IEEE Conference on Computer Vision and Pattern Recognition
With Sourya Roy, Sujoy Paul and Amit K Roy-Chowdhury
Minimization of labeling effort for person re-identification in camera networks is an important problem as most of the existing popular methods are supervised and they require large amount of manual annotations, acquiring which is a tedious job. In this work, we focus on this labeling effort minimization problem and approach it as a subset selection task where the objective is to select an optimal subset of image-pairs for labeling without compromising performance. Towards this goal, our proposed scheme first represents any camera network (with k number of cameras) as an edge weighted complete k-partite graph where each vertex denotes a person and similarity scores between persons are used as edge-weights. Then in the second stage, our algorithm selects an optimal subset of pairs by solving a triangle free subgraph maximization problem on the k-partite graph. This sub-graph weight maximization problem is NP-hard (at least for k larger than 3) which means for large datasets the optimization problem becomes intractable. In order to make our framework scalable, we propose two polynomial time approximately-optimal algorithms. The first algorithm is a 1/2-approximation algorithm which runs in linear time in the number of edges. The second algorithm is a greedy algorithm with sub-quadratic (in number of edges) time-complexity. Experiments on three state-of-the-art datasets depict that the proposed approach requires on an average only 8-15% manually labeled pairs in order to achieve the performance when all the pairs are manually annotated.
- more...
- full text
- bibtex
- google scholar
- doi
- link
CRC Handbook of Approximation Algorithms and Metaheuristics (Chapter 4)(2018)
With Samir Khuller and Balaji Raghavachari
An introduction to greedy approximation algorithms for combinatorial optimization problems. Informal definition of greedy algorithms. Greedy Set Cover, and other generalizations and applications. Greedy algorithms for other problems. Lagrangean relaxation based methods. Local ratio, greedy dual based schemes and greedy rounding. Connected Dominating Sets and other applications. Priority Based Approaches.
- more...
- full text
- bibtex
- google scholar
- link
IEEE International Conference on Data Engineering
With Nhat X.T. Le and Vagelis Hristidis
Poster version of [2019].
Poster.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Data Mining and Knowledge Discovery 30(1):243-281(2016)
With Jesin Zakaria, Abdullah Mueen and Eamonn Keogh
Over the past decade, time series clustering has become an increasingly important research topic in data mining community. Most existing methods for time series clustering rely on distances calculated from the entire raw data using the Euclidean distance or Dynamic Time Warping distance as the distance measure. However, the presence of significant noise, dropouts, or extraneous data can greatly limit the accuracy of clustering in this domain. Moreover, for most real world problems, we cannot expect objects from the same class to be equal in length. As a consequence, most work on time series clustering only considers the clustering of individual time series ``behaviors,'' e.g., individual heart beats or individual gait cycles, and contrives the time series in some way to make them all equal in length. However, automatically formatting the data in such a way is often a harder problem than the clustering itself. In this work, we show that by using only some local patterns and deliberately ignoring the rest of the data, we can mitigate the above problems and cluster time series of different lengths, e.g., cluster one heartbeat with multiple heartbeats. To achieve this, we exploit and extend a recently introduced concept in time series data mining called shapelets. Unlike existing work, our work demonstrates the unintuitive fact that shapelets can be learned from unlabeled time series. We show, with extensive empirical evaluation in diverse domains, that our method is more accurate than existing methods. Moreover, in addition to accurate clustering results, we show that our work also has the potential to give insight into the domains to which it is applied. While a brute-force algorithm to discover shapelets in an unsupervised way could be untenably slow, we introduce two novel optimization procedures to significantly speed up the unsupervised-shapelet discovery process and allow it to be cast as an anytime algorithm.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Springer Encyclopedia of Algorithms:886-889(2016)
A brief summary of greedy set-cover algorithms.

INDEX TERMS: dominating set, greedy algorithm, hitting set, set cover, minimizing a linear function subject to a submodular constraint.
- more...
- full text
- bibtex
- google scholar
- doi
- link
working paper(2016)
We describe the first nearly linear-time approximation algorithms for explicitly given mixed packing/covering linear programs, and for (non-metric) fractional facility location. We also describe the first parallel algorithms requiring only near-linear total work and finishing in polylog time. The algorithms compute $(1+\epsilon)$-approximate solutions in time (and work) $\tilde O(N/\epsilon^2)$, where $N$ is the number of non-zeros in the constraint matrix. For facility location, $N$ is the number of eligible client/facility pairs.
- more...
- full text
- bibtex
- google scholar
- link
Springer Encyclopedia of Algorithms:1457-1461(2016)
A brief summary of online paging and caching results, 1985-2013.

INDEX TERMS: paging, caching, weighted caching, weighted paging, file caching, least recently used (paging algorithm), first in first out (paging algorithm), flush when full (paging algorithm), the Marking algorithm (paging algorithm), Balance algorithm (weighted caching algorithm), Greedy Dual (weighted caching algorithm), Landlord (file caching algorithm), Squid (file caching software), k-server problem, primal-dual algorithms, randomized algorithms, online algorithms, competitive analysis, competitive ratio, loose competitiveness, access-graph model, Markov paging.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Journal of Scheduling 18(6):545-560(2015); (ICALP '13);
With Marcin Bienkowski, Jaroslaw Byrka, Marek Chrobak, Neil Dobbs, Thomas Nowicki, Maxim Sviridenko and Grzegorz Swirszcz
The Joint Replenishment Problem (JRP) is a fundamental optimization problem in supply-chain management, concerned with optimizing the flow of goods over time from a supplier to retailers. Over time, in response to demands at the retailers, the supplier sends shipments, via a warehouse, to the retailers. The objective is to schedule shipments to minimize the sum of shipping costs and retailers' waiting costs. We study the approximability of JRP with deadlines, where instead of waiting costs the retailers impose strict deadlines. We study the integrality gap of the standard linear-program (LP) relaxation, giving a lower bound of 1.207, and an upper bound and approximation ratio of 1.574. The best previous upper bound and approximation ratio was 1.667; no lower bound was previously published. For the special case when all demand periods are of equal length we give an upper bound of 1.5, a lower bound of 1.2, and show APX-hardness.
Journal version of [2013].
- more...
- full text
- bibtex
- google scholar
- doi
- link
SIAM Journal on Computing 44(4):1154-1172(2015); (IPCO'99)
With Philip Klein
This paper gives a lower bound on the complexity of $(1 \pm \epsilon)$-approximately solving a packing or covering problem using a certain class of Lagrangian-relaxation algorithms: any such algorithm, given a problem formed by a random {0,1}-matrix, requires (with high probability) a number of iterations proportional to $(\rho \epsilon)^2\log m$. (Here $\rho$ is a technical parameter, the``width'' of the problem instance.)

The class of algorithms in question includes Dantzig-Wolfe decomposition, Benders decomposition, the Lagrangian-relaxation method developed by Held and Karp [1971] for lower-bounding TSP, and algorithms recently studied by many authors (including Serge Plotkin, David Shmoys, and Eva Tardos [1988]; Mike Grigoriadis and L.G. Khachiyan [1996]; Michael Luby and Noam Nisan [1993]; Naveen Garg and Jochen Konemann [1998]; and Lisa Fleischer [1999]). The lower bound matches the known upper bounds within a constant factor. The lower bound is useful because, in practice, the dependence on $\epsilon$ limits the applicability of these algorithms. The lower bound provides some insight into what is necessary to surmount this dependence.
Journal version of [1999]. Published online Aug. 2015.
- more...
- full text
- bibtex
- google scholar
- doi
- link
International Symposium on Algorithms and Computation (LNCS 9472)
With Marek Chrobak, Mordecai Golin and J. Ian Munro
In 1971, Knuth gave an $O(n^2)$-time algorithm for the classic problem of finding an optimal binary search tree. Knuth's algorithm works only for search trees based on 3-way comparisons, while most modern computers support only 2-way comparisons (e.g., $<$, $\le$, $=$, $\ge$, and $>$). Until this paper, the problem of finding an optimal search tree using 2-way comparisons remained open --- poly-time algorithms were known only for restricted variants. We solve the general case, giving (i) an $O(n^4)$-time algorithm and (ii) an $O(n\log n)$-time additive-3 approximation algorithm.

For finding optimal binary split trees, we (iii) obtain a linear speedup and (iv) prove some previous work incorrect.
Conference version of [Chrobak22Simple, Chrobak21Cost, Chrobak22Huang]
- more...
- full text
- bibtex
- google scholar
- doi
- link
Handbook of Graph Theory, Combinatorial Optimization, and Algorithms(2015)
A brief summary of greedy algorithms for set cover and vertex cover.

INDEX TERMS: greedy algorithm, set cover, vertex cover.
- more...
- bibtex
- google scholar
- link
Algorithmica 70(4):648-674(2014); FOCS'07
With Christos Koufogiannakis^*
We give an approximation algorithm for packing and covering linear programs (linear programs with non-negative coefficients). Given a constraint matrix with $N$ non-zeros, $r$ rows, and $c$ columns, the algorithm (with high probability) computes feasible primal and dual solutions whose costs are within a factor of $1+\epsilon$ of the optimal cost in time $O((r+c)\log(N)/\epsilon^2 + N)$.
Journal version of [2007].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
ACM-SIAM Symposium on Discrete Algorithms
With Monik Khare^* and Claire Mathieu
Can one choose a good Huffman code without knowing the underlying distribution? The following Online Slot Allocation (OSA) problem models this and similar problems. Given $n$ nonnegative slot costs $c(1), c(2), ..., c(n)$, requests for items are drawn from a hidden probability distribution $p$ on $[n]$, and revealed online one by one. In response to each request, if the item has not been previously requested, the algorithm (knowing only $c$ and the items requested so far) must allocate some unused slot for the item. The objective is to minimize the cost of the slot allocation, which is the sum, over the items, of the probability of the item times the cost of its assigned slot.

The online First-Come-First-Served algorithm (FCFS) simply allocates the cheapest slot available when an item needing a slot arrives. FCFS is optimal among online algorithms. For general costs, we show that its competitive ratio is $1+\ln n$. For concave costs, its competitive ratio is 2. For logarithmic costs, the ratio is, asymptotically, 1: the algorithm's expected cost is $\mbox{opt} + O(\log \mbox{opt})$.

A practical application of OSA is Online Huffman Coding (OHC), a variant of Huffman coding in which the codewords must be allocated on demand, without knowing the underlying probability distribution. OHC reduces to OSA, yielding a first-come-first-served algorithm for OHC that guarantees near-optimal expected cost, at most $\mbox{opt} + 2 \log(1+\mbox{opt}) + 2$.
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
SIAM Journal on Computing 43(1):25-51(2014); SODA'12
With Arman Yousefi^*
Minimum-weight triangulation (MWT) is NP-hard. It has a polynomial-time constant-factor approximation algorithm, and a variety of effective polynomial- time heuristics that, for many instances, can find the exact MWT. Linear programs (LPs) for MWT are well-studied, but previously no connection was known between any LP and any approximation algorithm or heuristic for MWT. Here we show the first such connections: for an LP formulation due to Dantzig et al. (1985): (i) the integrality gap is bounded by a constant; (ii) given any instance, if the aforementioned heuristics find the MWT, then so does the LP. Our analysis improves the best approximation ratio known for MWT (which is an astronomically large constant) by an order of magnitude.
Journal version of [2012].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
Automata, Languages, and Programming (ICALP '13); Lecture Notes in Computer Science Volume 7965
With Marcin Bienkowski, Jaroslaw Byrka, Marek Chrobak, Neil Dobbs, Thomas Nowicki, Maxim Sviridenko and Grzegorz Swirszcz
Conference version of [2014].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Algorithmica 66(1):113-152(2013); ICALP'09
With Christos Koufogiannakis^*
This paper describes a simple greedy Δ-approximation algorithm for any covering problem whose objective function is submodular and non-decreasing, and whose feasible region can be expressed as the intersection of arbitrary (closed upwards) covering constraints, each of which constrains at most Δ variables. (A simple example is Vertex Cover, with Δ = 2.) The algorithm generalizes previous approximation algorithms for fundamental covering problems and online paging and caching problems.

(For distributed implementations of these algorithms, see Distributed algorithms for covering, packing and maximum weighted matching .)
Journal version of [2009].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
Theory of Computing 9(22):685-702(2013)
With Daniel Sheldon
Given a satisfiable 3-SAT formula, how hard is it to find an assignment to the variables that has Hamming distance at most $n/2$ to a satisfying assignment? More generally, consider any polynomial-time verifier for any NP-complete language. A $d(n)$-Hamming-approximation algorithm for the verifier is one that, given any member $x$ of the language, outputs in polynomial time a string $a$ with Hamming distance at most $d(n)$ to some witness $w$, where $(x,w)$ is accepted by the verifier. Previous results have shown that, if P$\ne$NP, then every NP-complete language has a verifier for which there is no $(n/2-n^{2/3+\delta})$-Hamming-approximation algorithm, for various constants $\delta\ge 0$.

Our main result is that, if P$\ne$NP, then every paddable NP-complete language has a verifier that admits no $(n/2+O(\sqrt{n\log n}))$-Hamming-approximation algorithm. That is, one cannot get even half the bits right. We also consider natural verifiers for various well-known NP-complete problems. They do have $n/2$-Hamming-approximation algorithms, but, if P$\ne$NP, have no $(n/2-n^\epsilon)$-Hamming-approximation algorithms for any constant $\epsilon>0$.

We show similar results for randomized algorithms.
First draft circulated in 2003.
- more...
- full text
- bibtex
- google scholar
- doi
- link
SIAM Journal on Computing 41(3):684-713(2012); STOC'02
With Mordecai Golin and Claire Mathieu
We study the generalization of Huffman Coding in which codeword letters have non-uniform costs (as in Morse code, where the dash is twice as long as the dot). Despite previous work by many authors including Richard Karp [1961] and Kurt Mehlhorn [1980], the problem is not known to be NP-hard, nor was it previously known to have a constant-factor approximation algorithm. The paper describes a polynomial-time approximation scheme (PTAS) for the problem. The algorithm computes a $(1+\epsilon)$-approximate solution in time $O(n + f(\epsilon) \log^3 n)$, where $n$ is the input size.
Journal version of [2002].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
ACM-SIAM Symposium on Discrete Algorithms
With Arman Yousefi^*
Conference version of [2013].
- more...
- full text
- bibtex
- google scholar
- link
Distributed Computing 24(1):45-63(2011); PODC'09, DISC'09
With Christos Koufogiannakis^*
This paper gives poly-logarithmic-round, distributed δ-approximation algorithms for covering problems with submodular cost and monotone covering constraints (Submodular-cost Covering). The approximation ratio δ is the maximum number of variables in any constraint. Special cases include Covering Mixed Integer Linear Programs (CMIP), and Weighted Vertex Cover (with δ = 2).

Via duality, the paper also gives poly-logarithmic-round, distributed δ-approximation algorithms for Fractional Packing linear programs (where δ is the maximum number of constraints in which any variable occurs), and for Max Weighted c-Matching in hypergraphs (where δ is the maximum size of any of the hyperedges; for graphs δ = 2).

The paper also gives parallel (RNC) 2-approximation algorithms for CMIP with two variables per constraint and Weighted Vertex Cover.

The algorithms are randomized. All of the approximation ratios exactly match those of comparable centralized algorithms.

(This paper gives distributed implementations of algorithms from Greedy δ-approximation algorithm for covering with arbitrary constraints and submodular cost .)
Journal version of two conference papers, on [covering] and [packing].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
ACM International Conference on Knowledge Discovery and Data Mining
With Abdullah Mueen and Eamonn Keogh
Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant adoption and extension in the community.

Despite their immense potential as a data mining primitive, there are two important limitations of shapelets. First, their expressiveness is limited to simple binary presence/absence questions. Second, even though shapelets are computed offline, the time taken to compute them is significant.

In this work, we address the latter problem by introducing a novel algorithm that finds shapelets in less time than current methods by an order of magnitude. Our algorithm is based on intelligent caching and reuse of computations, and the admissible pruning of the search space. Because our algorithm is so fast, it creates an opportunity to consider more expressive shapelet queries. In particular, we show for the first time an augmented shapelet representation that distinguishes the data based on conjunctions or disjunctions of shapelets. We call our novel representation Logical-Shapelets. We demonstrate the efficiency of our approach on the classic benchmark datasets used for these problems, and show several case studies where logical shapelets significantly outperform the original shapelet representation and other time series classification techniques. We demonstrate the utility of our ideas in domains as diverse as gesture recognition, robotics, and biometrics.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Algorithms and Theory of Computation Handbook, Second Edition, Volume 1: General Concepts and Techniques(2010)
With Philip Klein
A 40-page sketch of basic techniques in the design and analysis of combinatorial approximation algorithms. Sections: Introduction, Underlying principles, Approximation algorithms with small additive error, Performance guarantees, Randomized rounding and linear programming, Performance ratios and c-approximation, Polynomial approximation schemes, Constant-factor performance guarantees, Logarithmic performance guarantees, Multi-criteria problems, Hard-to-approximate problems, Research Issues and Summary, Defining Terms, Further Information.
- more...
- full text
- bibtex
- google scholar
- link
ACM Symposium on Principles of Distributed Computing
With Christos Koufogiannakis^*
The paper presents distributed and parallel δ-approximation algorithms for covering problems, where δ is the maximum number of variables on which any constraint depends (for example, δ=2 for vertex cover).

Specific results include the following.
- For weighted vertex cover, the first distributed 2-approximation algorithm taking O(log n) rounds and the first parallel 2-approximation algorithm in RNC. The algorithms generalize to covering mixed integer linear programs (CMIP) with two variables per constraint (δ=2).
- For any covering problem with monotone constraints and submodular cost, a distributed $\delta$-approximation algorithm taking O(log² |C|) rounds, where |C| is the number of constraints. (Special cases include CMIP, facility location, and probabilistic (two-stage) variants of these problems.)
(This paper gives distributed implementations of algorithms from Greedy δ-approximation algorithm for covering with arbitrary constraints and submodular cost .)
Conference version of part of Distributed algorithms for covering, packing and maximum weighted matching.
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
International Symposium on Distributed Computing
With Christos Koufogiannakis^*
We present efficient distributed δ-approximation algorithms for fractional packing and maximum weighted b-matching in hypergraphs, where δ is the maximum number of packing constraints in which a variable appears (for maximum weighted b-matching δ is the maximum edge degree - for graphs δ = 2). (a) For δ = 2 the algorithm runs in O(log m) rounds in expectation and with high probability. (b) For general δ, the algorithm runs in O(log² m) rounds in expectation and with high probability.

(This paper gives distributed algorithms for the duals of problems considered in Greedy δ-approximation algorithm for covering with arbitrary constraints and submodular cost .)
Conference version of part of Distributed Algorithms for Covering, Packing and Maximum Weighted Matching.
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
International Colloquium on Automata, Languages, and Programming (LNCS 5555)
With Christos Koufogiannakis^*
Conference version of [2013].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
IEEE Transactions on Mobile Computing 8(5):590-605(2009); SECON'06
With Ece Gelal, Gentian Jakllari and Srikanth Krishnamurthy
With fully directional communications, nodes must track and periodically update the positions of their discovered neighbors, so that communication with these neighbors is feasible when needed. If tracking fails, neighbors that move out of the directional footprint will need to be rediscovered. The tracking process introduces an overhead, which increases with the number of discovered neighbors. The overhead can be reduced if each node maintains only a subset of its neighbors; however, this may increase the hop count of paths between nodes, i.e., causes a hop stretch. In this work, we study the tradeoffs between node degree and hop stretch. We first design a topology control algorithm to optimize this tradeoff. For the purposes of this design, we assume that nodes perform circular directional transmissions to communicate with the nodes in their directional range; in this way, the network can be modeled as a unit disk graph (UDG). Given a UDG G, our algorithm finds a sparse subgraph G' with a maximum node degree of 6, and connecting each node pair u, v by a path of length hops_G'(u,v) = O(hops_G(u,v) + log \delta), where \delta is the maximum degree in G, and hops_G(u,v) denotes the length of the shortest path between u and v in graph G. We show that this result is near optimal. Based on the insights gained from the above design, we next construct a more practical scheme, which integrates topology control with fully directional neighbor discovery and maintenance. The simulated performance of our practical scheme is only slightly worse than the theoretical optimal performance in terms of node degree and path stretch.
Journal version of [2006].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Springer Encyclopedia of Algorithms(2008)
A brief summary of greedy set-cover algorithms.

INDEX TERMS: dominating set, greedy algorithm, hitting set, set cover, minimizing a linear function subject to a submodular constraint.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Algorithmica 50(4):455-478(2008); Latin American Theoretical Informatics (LATIN'06; LNCS 3887:311-322)
With Marek Chrobak, Claire Mathieu and John Noga
Following Mettu and Plaxton, we study incremental algorithms for the $k$-medians problem. Such an algorithm must produce a nested sequence $F_1 \subseteq F_2 \subseteq \cdots \subseteq F_n$ of sets of facilities. Mettu and Plaxton show that incremental metric medians has a (roughly) $40$-competitive deterministic polynomial-time algorithm. We give improved algorithms, including a $(24+\epsilon)$-competitive deterministic polynomial-time algorithm and a $5.44$-competitive, randomized, non-polynomial-time algorithm.

We also consider the competitive ratio with respect to size. An algorithm is $s$-size-competitive if, for each $k$, the cost of $F_k$ is at most the minimum cost of any set of $k$ facilities, while the size of $F_k$ is at most $s k$. We present optimally competitive algorithms for this problem.

Our proofs reduce incremental medians to the following online bidding problem: faced with some unknown threshold $T>0$, an algorithm must submit ``bids'' $b>0$ until it submits a bid as large as $T$. The algorithm pays the sum of its bids. We describe optimally competitive algorithms for online bidding.

Our results on cost-competitive incremental medians extend to approximately metric distance functions, incremental fractional medians, and incremental bicriteria approximation.
Journal version of [2006].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
Springer Encyclopedia of Algorithms(2008)
A brief summary of online paging and caching results, 1985-2002.

INDEX TERMS: paging, caching, weighted caching, weighted paging, file caching, least recently used (paging algorithm), first in first out (paging algorithm), flush when full (paging algorithm), the Marking algorithm (paging algorithm), Balance algorithm (weighted caching algorithm), Greedy Dual (weighted caching algorithm), Landlord (file caching algorithm), Squid (file caching software), k-server problem, primal-dual algorithms, randomized algorithms, online algorithms, competitive analysis, competitive ratio, loose competitiveness, access-graph model, Markov paging.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Journal of Bioinformatics and Computational Biology 5(4):937-961(2007); Asia-Pacific Bioinformatics Conference
With Qi Fu, Elizabeth Bent, James Borneman and Marek Chrobak
We study the problem of selecting control clones in DNA array hybridization experiments. The problem arises in the OFRG method for analyzing microbial communities. The OFRG method performs classification of rRNA gene clones using binary fingerprints created from a series of hybridization experiments, where each experiment consists of hybridizing a collection of arrayed clones with a single oligonucleotide probe. This experiment produces analog signals, one for each clone, which then need to be classified, that is, converted into binary values 1 and 0 that represent hybridization and non-hybridization events. In addition to the sample rRNA gene clones, the array contains a number of control clones needed to calibrate the classification procedure of the hybridization signals. These control clones must be selected with care to optimize the classification process. We formulate this as a combinatorial optimization problem called Balanced Covering. We prove that the problem is NP-hard, and we show some results on hardness of approximation. We propose approximation algorithms based on randomized rounding, and we show that, with high probability, our algorithms approximate well the optimum solution. The experimental results confirm that the algorithms find high quality control clones. The algorithms have been implemented and are publicly available as part of the software package called CloneTools.
Journal version of [2007].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Asia-Pacific Bioinformatics Conference
With Qi Fu, Elizabeth Bent, James Borneman and Marek Chrobak
Conference version of [2007].
- more...
- full text
- bibtex
- google scholar
- link
IEEE Symposium on Foundations of Computer Science
With Christos Koufogiannakis^*
Conference version of [2013].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
ACM International Conference on Knowledge Discovery and Data Mining
With Deepak Agarwal, Dhiman Barman, Dimitrios Gunopulos, Flip Korn and Divesh Srivastava
Given a rooted tree whose leaves have numeric labels, label the nodes so that each leaf label equals the sum of the node labels from the root to the leaf. Minimize the number of non-zero node labels. This problem models finding a parsimonius explanation of observed changes in hierarchical data. We give fast algorithms with applications.
- more...
- full text
- bibtex
- google scholar
- doi
- link
CRC Handbook of Approximation Algorithms and Metaheuristics (Chapter 4)(2007)
With Samir Khuller and Balaji Raghavachari
An introduction to greedy approximation algorithms for combinatorial optimization problems. Informal definition of greedy algorithms. Greedy Set Cover, and other generalizations and applications. Greedy algorithms for other problems. Lagrangean relaxation based methods. Local ratio, greedy dual based schemes and greedy rounding. Connected Dominating Sets and other applications. Priority Based Approaches.
- more...
- full text
- bibtex
- google scholar
- doi
- link
IEEE International Conference on Database Engineering
With Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos and Deepak Agarwal
Poster presentation of [2007].
- more...
- full text
- bibtex
- google scholar
- link
MASS: IEEE International Conference on Mobile Ad-hoc and Sensor Systems
With Ece Gelal, Gentian Jakllari and Srikanth Krishnamurthy
This paper is a follow-up to Topology control to simultaneously achieve near-optimal node degree and low path stretch in ad hoc networks. It gives a related heuristic with more extensive empirical evaluation.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Latin American Theoretical Informatics (LATIN; LNCS 3887)
With Marek Chrobak, Claire Mathieu and John Noga
Conference version of [2008].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
Information Processing Letters 97:68-72(2006); COCOON'05
With Marek Chrobak and Claire Mathieu
The Reverse Greedy algorithm (RGreedy) for the $k$-median problem works as follows. It starts by placing facilities on all nodes. At each step, it removes a facility to minimize the resulting total distance from the customers to the remaining facilities. It stops when $k$ facilities remain. We prove that, if the distance function is metric, then the approximation ratio of RGreedy is between $\Omega(\log n/ \log \log n)$ and $O(\log n)$.
Journal version of [2005].
- more...
- full text
- bibtex
- google scholar
- doi
- link
IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks
With Ece Gelal, Gentian Jakllari and Srikanth Krishnamurthy
Conference version of [2009].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Journal of Computer and System Sciences 71(4):495-505(2005); FOCS'01
With Stavros Kolliopoulos
Given matrices A and B and vectors a, b, c and d, all with non-negative entries, we consider the problem of minimizing c$\cdot$x subject to x$\in Z^n_+$, Ax ≥ a, Bx ≤ b, and x ≤ d. We give a bicriteria-approximation algorithm that, given \epsilon$\in$(0, 1], finds a solution of cost O(log(m)/\epsilon²) times optimal, meeting the covering constraints Ax≥a and multiplicity constraints (x≤d), and satisfying Bx≤(1 + \epsilon)b + β, where β is the vector of row sums, β_i=$\sum_j$ B_ij. Here m is the number of rows of A.

This gives an O(log m)-approximation algorithm for CIP --- minimum-cost covering integer programs with multiplicity constraints, i.e., the special case when there are no packing constraints Bx ≤ b. The previous best approximation ratio has been O(log max_j $\sum_i$A_ij since 1982. CIP contains the set cover problem as a special case, so O(log m)-approximation is the best possible unless P=NP.
Journal version of Tight approximation results for general covering integer programs.
- more...
- full text
- bibtex
- google scholar
- doi
- link
International Computing and Combinatorics Conference
With Marek Chrobak and Claire Mathieu
Conference version of [2006].
- more...
- full text
- bibtex
- google scholar
- link
Mathematics of Operations Research 29(3):0436-0461(2004); STOC'99
With David Karger, Philip Klein, Cliff Stein and Mikkel Thorup
The multiway-cut problem is, given a weighted graph and $k > 2$ terminal nodes, to find a minimum-weight set of edges whose removal separates all the terminals. The problem is NP-hard, and even NP-hard to approximate within $1+\delta$ for some small $\delta > 0$.

This paper gives a 12/11-approximation algorithm for $k=3$ and a 1.3438-approximation algorithm in general. Working on this result was particularly fun because the problem of designing the algorithm was itself formulated as an optimization problem and approximately solved by computer. These computations guided the final (non-computational) solutions.

The paper also gives a proof that for a general class of ``geometric embedding'' relaxations, there are always randomized rounding schemes that match the integrality gap. (Another example of such a geometric-embedding relaxation is the semidefinite-programming relaxation of max-cut by Goemans and Williamson [1995].)

The geometric relaxation is due to Calinescu, Karloff, and Rabani [1998]. After our paper, the approximation ratio for general $k$ was subsequently improved to 1.3238 in [2013].

Journal version of [1999].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
The Astronomical Journal 125:2276-2286(2003)
With Michael R. Blanton, Huan Lin, Robert Lupton and Miller Maley
This paper describes a modification to the heuristic algorithm presented in ``Data collection for the Sloan Digital Sky Survey: a network-flow heuristic.'' The modification takes into account that galaxies that are too close together cannot be captured in the same snapshot. This was not considered in the original algorithm.
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM Symposium on Theory of Computing
With Mordecai Golin and Claire Mathieu
Conference version of [2012].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
Algorithmica 33(3):371-383(2002); SODA'98
This paper introduced a new file-caching algorithm, called Landlord or Greedy-Dual-size, for use by web proxies and browsers. The paper gave theoretical analyses of the algorithm suggesting that it would perform well in practice.

A modified version of the algorithm is incorporated into the public-domain Squid web proxy.

The algorithm was independently obtained by Pei Cao and Sandy Irani [1997]. Both they and John Dilley et al [1999] found that the algorithm worked well empirically.

The results in this paper strengthen and generalize those in [1994] and are further generalized in [2009].
Journal version of [1998].
- more...
- full text
- bibtex
- google scholar
- doi
- link
IEEE Symposium on Foundations of Computer Science
With Naveen Garg
Congestion control in the current Internet is accomplished mainly by TCP/IP. To understand the macroscopic network behavior that results from TCP/IP and similar end-to-end protocols, one main analytic technique is to show that the the protocol maximizes some global objective function of the network traffic.

Here we analyze a particular end-to-end, MIMD (multiplicative-increase, multiplicative-decrease) protocol. We show that if all users of the network use the protocol, and all connections last for at least logarithmically many rounds, then the total weighted throughput (value of all packets received) is near the maximum possible. Our analysis includes round-trip-times, and (in contrast to most previous analyses) gives explicit convergence rates, allows connections to start and stop, and allows capacities to change.

An application explained in the paper is multi-path bandwidth measurement.
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
IEEE Symposium on Foundations of Computer Science
Mixed packing and covering problems are problems that can be formulated as linear programs using only non-negative coefficients. Examples include multicommodity network flow, the Held-Karp lower bound on TSP, fractional relaxations of set cover, bin-packing, knapsack, scheduling problems, minimum-weight triangulation, etc. This paper gives approximation algorithms for the general class of problems. The sequential algorithm can be implemented to find an $(1\pm\epsilon)$-approximate solution in $O(\epsilon^{-2}\log m)$ linear-time iterations.

For $\epsilon = O(1)$, these algorithms are currently the fastest known for the general problem. The results generalize previous work on pure packing and covering (the special case when the constraints are all ``less-than'' or all ``greater-than'') by Michael Luby and Noam Nisan [1993]; and Naveen Garg and Jochen Konemann [1998]; and Lisa Fleischer [1999]
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
IEEE Symposium on Foundations of Computer Science
With Stavros Kolliopoulos
Conference version of Approximation algorithms for covering/packing integer programs.
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM-SIAM Symposium on Discrete Algorithms
The paper gives approximation algorithms for the $k$-medians and facility-location problems (both NP-hard). For $k$-medians, the algorithm returns a solution using at most $\ln(n+n/\epsilon) k$ medians and having cost at most $(1+\epsilon)$ times the cost of the best solution that uses at most $k$ medians. Here $\epsilon>0$ is an input to the algorithm. In comparison, the best previous algorithm [Lin and Jeff Vitter, 1992] had a $(1+1/\epsilon)\ln n$ term instead of the $\ln(n+n/\epsilon)$ term in the performance guarantee. For facility location, the algorithm returns a solution of cost at most $d+\ln(n) k$, provided there exists a solution of cost $d+k$ where $d$ is the assignment cost and $k$ is the facility cost. In comparison, the best previous algorithm [Dorit Hochbaum, 1982] returned a solution of cost at most $\ln(n)(d+k)$. For both problems, the algorithms currently provide the best performance guarantee known for the general (non-metric) problems.

The paper also introduces a new probabilistic bound (the so-called Chernoff-Wald bound) for bounding the expectation of the maximum of a collection of sums of random variables, when each sum contains a random number of terms. The bound is used to analyze the randomized rounding scheme that underlies the algorithms.
- more...
- full text
- slides
- bibtex
- google scholar
- link
Journal of Algorithms 37(1):218-235(2000); SODA'98
Elias Koutsoupias and Christos Papadimitriou introduced the so-called diffuse-adversary model for analyzing paging strategies such as least-recently-used (LRU) [1994]. Briefly, the diffuse adversary model is as follows. Fix some $\epsilon >0$ Run the algorithm, and, for each request, allow the adversary to assign a probability distribution to the items such that no item gets probability more than $\epsilon$, then request an item randomly from the distribution. (When $\epsilon$ is near 1, one has the standard competitive-analysis model; when $\epsilon$ is near 0, the adversary is severely handicapped.) The algorithm is $c$-competitive if the expected cost of the algorithm on the adversarial sequence is at most $c$ times the expected cost of the optimal algorithm.

Koutsoupias and Papadimitriou proved that LRU was optimally competitive (among deterministic on-line algorithms) according to the model, but left open the question of what the performance guarantee for LRU actually was. This paper answers the question, giving an explicit formula for the performance guarantee within a factor of two. The paper also shows that the performance guarantee of first-in-first-out (FIFO) is much worse than that of LRU, and analyzes the performance guarantees of randomized on-line paging strategies.
Journal version of Bounding the diffuse adversary.
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM Symposium on Theory of Computing
With Claire Mathieu and Nicolas Schabanel
The data broadcast problem is to find a schedule for broadcasting a given set of messages over multiple channels. The goal is to minimize the cost of the broadcast plus the expected response time to clients who periodically and probabilistically tune in to wait for particular messages.

The problem models disseminating data to clients in asymmetric communication environments, where there is a much larger capacity from the information source to the clients than in the reverse direction. Examples include satellites, cable TV, Internet broadcast, and mobile phones. Such environments favor the ``push-based'' model where the server broadcasts (pushes) its information on the communication medium and multiple clients simultaneously retrieve the specific information of individual interest.

This paper presents the first polynomial-time approximation scheme (PTAS) for data broadcast with $O(1)$ channels and when each message has arbitrary probability, unit length and bounded cost. The best previous polynomial-time approximation algorithm for this case has a performance ratio of 9/8 [Bar-Noy et al, 2000].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Algorithms and Theory of Computation Handbook, First Edition(1999)
With Philip Klein
A 40-page sketch of basic techniques in the design and analysis of combinatorial approximation algorithms. Sections: Introduction, Underlying principles, Approximation algorithms with small additive error, Performance guarantees, Randomized rounding and linear programming, Performance ratios and c-approximation, Polynomial approximation schemes, Constant-factor performance guarantees, Logarithmic performance guarantees, Multi-criteria problems, Hard-to-approximate problems, Research Issues and Summary, Defining Terms, Further Information.
- more...
- full text
- bibtex
- google scholar
- link
ACM-SIAM Symposium on Discrete Algorithms
With Jay Aslam, April Rasala and Cliff Stein
Two common objectives for evaluating a schedule are the makespan, or schedule length, and the average completion time. This short note gives improved bounds on the existence of schedules that simultaneously optimize both criteria. In particular, for any ρ > 0, there exists a schedule of makespan at most 1+ρ times the minimum, with average completion time at most 1/(1-e^-ρ) times the minimum. The proof uses an infinite-dimensional linear program to generalize and strengthen a previous analysis by Cliff Stein and Joel Wein [1997].
- more...
- full text
- bibtex
- google scholar
- link
Integer Programming and Combinatorial Optimization (LNCS 1610)
With Philip Klein
Conference version of [2015].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
ACM Symposium on Theory of Computing
With David Karger, Philip Klein, Cliff Stein and Mikkel Thorup
Conference version of [2004].
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
ACM-SIAM Symposium on Discrete Algorithms
Conference version of On-line paging against adversarially biased random inputs.
- more...
- full text
- bibtex
- google scholar
- link
Journal of Algorithms 27(2):339-356(1998); SODA'96
With Robert Lupton and Miller Maley
The goal of the Sloan Digital Sky Survey is ``to map in detail one-quarter of the entire sky, determining the positions and absolute brightnesses of more than 100 million celestial objects''. The survey will be performed by taking ``snapshots'' through a large telescope. Each snapshot can capture up to 600 objects from a small circle of the sky (as illustrated here). This paper describes the design and implementation of the algorithm that is being used to determine the snapshots so as to minimize their number. The problem is NP-hard in general; the algorithm described is a heuristic, based on Lagrangian-relaxation and min-cost network flow. It gets within 5-15% of a naive lower bound, whereas using a ``uniform'' cover only gets within 25-35%.
Journal version of [1996].
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM-SIAM Symposium on Discrete Algorithms
Conference version of [2002].
- more...
- full text
- bibtex
- google scholar
- link
IEEE Data Compression Conference
With Qin Zhang and John Danskin
Pattern-matching-based document-compression systems (e.g. for faxing) rely on finding a small set of patterns that can be used to represent all of the ink in the document. Finding an optimal set of patterns is NP-hard; previous compression schemes have resorted to heuristics. This paper describes an extension of the cross-entropy approach, used previously for measuring pattern similarity, to this problem. This approach reduces the problem to a k-medians problem, for which the paper gives a new algorithm with a provably good performance guarantee. In comparison to previous heuristics (First Fit, with and without generalized Lloyd's/k-means post-processing steps), the new algorithm generates a better codebook, resulting in an overall improvement in compression performance of almost 17%.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Journal of Algorithms 24(2):310-324(1997)
With Sandor Fekete, Samir Khuller, Monika Klemmstein and Balaji Raghavachari
The problem considered is the following. Given a graph with edge weights satisfying the triangle inequality, and a degree bound for each vertex, compute a low-weight spanning tree such that the degree of each vertex is at most its specified bound. The problem is NP-hard (it generalizes Traveling Salesman (TSP)). The paper describes a network-flow heuristic for modifying a given tree T to meet the constraints. Choosing T to be a minimum spanning tree (MST) yields approximation algorithms with performance guarantee less than 2 for the problem on geometric graphs with L_p-norms. The paper also describes a Euclidean graph whose minimum TSP costs twice the MST, disproving a conjecture made in [Low-degree spanning trees of small weight].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Information Processing Letters 63(5):229-235(1997)
With S. L. Hakimi and Ed Schmeichel
The paper focuses on two problems: (i) how to orient the edges of an undirected graph in order to maximize the number of ordered vertex pairs $(x,y)$ such that there is a directed path from $x$ to $y$, and (ii) how to orient the edges so as to minimize the number of such pairs. The paper describes a quadratic-time algorithm for the first problem, and a proof that the second problem is NP-hard to approximate within some constant $1+\epsilon > 1$. The latter proof also shows that the second problem is equivalent to ``comparability graph completion''; neither problem was previously known to be NP-hard.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Journal of Combinatorial Theory, Series A a76(1):44-54(1996)
With Jessica Millar and Neil Sloane
N.J.A. Sloane maintains an on-line encyclopedia of interesting integer sequences. This paper presents an operation for modifying such a sequence, and analyzes the operation using generating functions. If you are interested, search for the word ``boustrophedon'' at [1] or go here.
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM-SIAM Symposium on Discrete Algorithms
With Robert Lupton and Miller Maley
Conference version of [1998].
- more...
- full text
- bibtex
- google scholar
- link
SIAM Journal on Computing 25(2):355-368(1996); STOC'94
With Samir Khuller and Balaji Raghavachari
The degree-d spanning tree problem asks for a minimum-weight spanning tree in which the degree of each vertex is at most $d$. When $d=2$ the problem is TSP, and in this case, the well-known Christofides algorithm provides a 1.5-approximation algorithm (assuming the edge weights satisfy the triangle inequality).

Christos Papadimitriou and Umesh Vazirani posed the challenge of finding an algorithm with performance guarantee less than 2 for Euclidean graphs (points in $R^n$) and $d > 2$ [1984]. This paper gives the first answer to that challenge, presenting an algorithm to compute a degree-3 spanning tree of cost at most 5/3 times the MST. For points in the plane, the ratio improves to 3/2 and the algorithm can also find a degree-4 spanning tree of cost at most 5/4 times the MST.
Journal version of [1994].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Discrete Applied Mathematics 69(3):281-289(1996)
With Samir Khuller and Balaji Raghavachari
The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard; This paper gives a proof that, for graphs where each directed cycle has at most three edges, the MEG problem is equivalent to maximum bipartite matching, and therefore solvable in polynomial time. This leads to an improvement in the performance guarantee of the previously best approximation algorithm for the general problem, from ``Approximating the minimum equivalent digraph'' [1995].

(This result was improved by Berman et al, WADS, 2009.)
- more...
- full text
- bibtex
- google scholar
- link
Fifth DIMACS Implementation Challenge
With Yossi Matias and S. Cenk Sahinalp
An experimental implementation and evaluation of data structures from ``Approximate data structures with applications'' [1994]. To summarize: ``Our results suggest that the data structure is practical and can be faster than traditional priority queues when holding a large number of keys, and that tolerance for approximate answers can lead to significant increases in speed.''
- more...
- full text
- bibtex
- google scholar
- link
SIAM Journal on Computing 25(6):1281-1304(1996); ICALP'94
With Mordecai Golin
Describes a near-linear-time algorithm for a variant of Huffman coding, in which the letters may have non-uniform lengths (as in Morse code), but with the restriction that each word to be encoded has equal probability.

See also [Huffman coding with unequal letter costs].
Journal version of [1994].
- more...
- full text
- bibtex
- google scholar
- doi
- link
SIAM Journal on Computing 24(4):859-872(1995); SODA'94
With Samir Khuller and Balaji Raghavachari
The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard. This paper gives an approximation algorithm with performance guarantee of pi²/6 ~ 1.64. The algorithm and its analysis are based on the simple idea of contracting long cycles.

(This result was strengthened slightly in ``On strongly connected digraphs with bounded cycle length'' [1996].) The analysis applies directly to 2-Exchange, a simple local improvement algorithm, showing that its performance guarantee is 1.75. It was further improved by Berman et al, WADS 2009.)
Journal version of [1994].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Algorithmica 14(4):305-322(1995); SODA'93
With Samir Khuller and Balaji Raghavachari
This paper has a simple linear-time algorithm to find a spanning tree that simultaneously approximates a shortest-path tree and a minimum spanning tree. The algorithm provides a continuous trade-off: given the two trees and $\epsilon > 0$, the algorithm returns a spanning tree in which the distance between any vertex and the root of the shortest-path tree is at most $1+\epsilon$ times the shortest-path distance, and yet the total weight of the tree is at most $1+2/\epsilon$ times the weight of a minimum spanning tree. This is the best trade-off possible.
Journal version of [1993].
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM-SIAM Symposium on Discrete Algorithms
Randomized rounding is a standard method, based on the probabilistic method, for designing combinatorial approximation algorithms. This paper introduces a variant of randomized rounding that can be used to derive Lagrangian-relaxation algorithms and greedy algorithms. This was surprising - the approximation algorithms derived previously by the method require first solving a linear program (or some other optimization problem). Indeed, in Prabhakar Raghavan's seminal paper introducing randomized rounding [1988], he says: "The time taken to solve the linear program relaxations of the integer programs dominates the net running time theoretically (and, most likely, in practice as well)."

The variant of randomized rounding described here avoids this bottleneck, at least for packing/covering-type problems. The resulting algorithms are Lagrangian-relaxation algorithms and greedy algorithms (in fact the greedy set-cover algorithm is an example), and are faster and simpler to implement than standard randomized-rounding algorithms. This approach gives a systematic, coherent, and comprehensive understanding of Lagrangian-relaxation algorithms for packing and covering linear programs.

The ideas introduced here are also used in ``Sequential and parallel algorithms for mixed packing and covering'' [2001], in ``K-medians, facility location, and the Chernoff-Wald bound'' [2000]), and in ``On the number of iterations for Dantzig-Wolfe optimization and packing-covering approximation algorithms'' [1999].
Lecture notes on this topic are here.
- more...
- full text
- slides
- bibtex
- google scholar
- link
Technical Report, School of ORIE, Cornell University 1103(1994)
We consider the problem of choosing Euclidean points to maximize the sum of their weighted pairwise distances, when each point is constrained to a ball centered at the origin. We derive a dual minimization problem and show strong duality holds (i.e., the resulting upper bound is tight) when some locally optimal configuration of points is affinely independent. We sketch a polynomial time algorithm for finding a near-optimal set of points.
- more...
- full text
- bibtex
- google scholar
- link
Journal of Algorithms 17(2):280-289(1994); IPCO'93
With Samir Khuller and Uzi Vishkin
The paper describes a simple deterministic parallel/distributed $(2+\epsilon)$-approximation algorithm for the minimum-weight vertex-cover problem and its dual (edge/element packing). This paper was one of the first to use the primal-dual method for approximation in the distributed setting. This result was strengthened in `` Distributed and parallel algorithms for weighted vertex cover and other covering problems.''
Journal version of [1993].
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM-SIAM Symposium on Discrete Algorithms
With Yossi Matias and Jeff Vitter
This paper explores the notion of approximate data structures, which return approximately correct answers to queries, but run faster than their exact counterparts. The paper describes approximate variants of the van Emde Boas data structure, which support the same dynamic operations as the standard van Emde Boas data structure (min, max, successor, predecessor, and existence queries, as well as insertion and deletion), except that answers to queries are approximate. The variants support all operations in constant time provided the performance guarantee is 1+1/polylog n, and in O(log log n) time provided the performance guarantee is 1+1/poly(n), for n elements in the data structure.

Applications described include Prim's minimum-spanning-tree algorithm, Dijkstra's single-source shortest paths algorithm, and an on-line variant of Graham's convex hull algorithm. To obtain output which approximates the desired output with the performance guarantee tending to 1, Prim's algorithm requires only linear time, Dijkstra's algorithm requires O(m log log n) time, and the on-line variant of Graham's algorithm requires constant amortized time per operation.
- more...
- full text
- bibtex
- google scholar
- link
ACM-SIAM Symposium on Discrete Algorithms
With Samir Khuller and Balaji Raghavachari
Conference version of [1995].
- more...
- full text
- bibtex
- google scholar
- link
Information Processing Letters 50(1):49-55(1994); WADS'93
With Samir Khuller and Balaji Raghavachari
The traditional multi-commodity flow problem assumes a given flow network in which multiple commodities are to be maximally routed in response to given demands. This paper considers the multi-commodity flow network-design problem: given a set of multi-commodity flow demands, find a network subject to certain constraints such that the commodities can be maximally routed.

This paper focuses on the case when the network is required to be a tree. The main result is an approximation algorithm for the case when the tree is required to be of constant degree. The algorithm reduces the problem to the minimum-weight balanced-separator problem; the performance guarantee of the algorithm is within a factor of 4 of the performance guarantee of the balanced-separator procedure. If Tom Leighton and Satish Rao's balanced-separator procedure [1988] is used, the performance guarantee is $O(\log n)$. This improves the $O(\log^2 n)$ approximation factor that is trivial to obtain by a direct application of the balanced-separator method.
Journal version of [1993].
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM Symposium on Theory of Computing
With Samir Khuller and Balaji Raghavachari
Conference version of [1996].
- more...
- full text
- bibtex
- google scholar
- link
International Colloquium on Automata, Languages, and Programming
With Mordecai Golin
Conference version of [Golin96Prefix].
- more...
- full text
- bibtex
- google scholar
- doi
- link
ACM Symposium on Theory of Computing
With Richard Lipton
Von Neumann's Min-Max Theorem guarantees that each player of a zero-sum matrix game has an optimal mixed strategy. This paper gives an elementary proof that each player has a near-optimal mixed strategy that chooses uniformly at random from a multi-set of pure strategies of size logarithmic in the number of pure strategies available to the opponent.

For exponentially large games, for which even representing an optimal mixed strategy can require exponential space, it follows that there are near-optimal, linear-size strategies. These strategies are easy to play and serve as small witnesses to the approximate value of the game.

As a corollary, it follows that every language has small ``hard'' multi-sets of inputs certifying that small circuits can't decide the language. For example, if SAT does not have polynomial-size circuits, then, for each $n$ and $c$, there is a set of $n^{O(c)}$ Boolean formulae of size $n$ such that no circuit of size $n^c$ (or algorithm running in time $n^c$) classifies more than two-thirds of the formulae successfully.

The "simple strategies" observation was generalized to Nash equilibrium by Lipton et al. in their paper Playing large games using simple strategies.
- more...
- full text
- slides
- bibtex
- google scholar
- doi
- link
Algorithmica 11(6):525-541(1994); SODA'91
This paper has two results. The first is based on the surprising observation that the well-known ``least-recently-used'' paging algorithm and the ``balance'' algorithm for weighted caching are linear-programming primal-dual algorithms. This observation leads to a strategy (called ``Greedy-Dual'') that generalizes them both and has an optimal performance guarantee for weighted caching.

For the second set of results, the paper presents empirical studies of paging algorithms, documenting that in practice, on ``typical'' cache sizes and sequences, the performance of paging strategies are much better than their worst-case analyses in the standard model suggest. The paper then presents theoretical results that support and explain this. For example: on any input sequence, with almost all cache sizes, either the performance guarantee of least-recently-used is $O(\log k)$ or the fault rate (in an absolute sense) is insignificant.

These results are strengthened and/or generalized in [1998] and [ 2009].
Journal version of [1991].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Integer Programming and Combinatorial Optimization
With Samir Khuller and Uzi Vishkin
Conference version of [1994].
- more...
- full text
- bibtex
- google scholar
- link
ACM-SIAM Symposium on Discrete Algorithms
With Samir Khuller and Balaji Raghavachari
Conference version of [1995].
- more...
- full text
- bibtex
- google scholar
- link
Workshop on Algorithms and Data Structures
With Samir Khuller and Balaji Raghavachari
Conference version of [1994].
- more...
- full text
- bibtex
- google scholar
- doi
- link
Journal of Algorithms 12(4):685-699(1991)
With Amos Fiat, Richard Karp, Michael Luby, Lyle McGeoch and Daniel Sleator
The paging problem is that of deciding which pages to keep in a memory of $k$ pages in order to minimize the number of page faults. This paper introduced the marking algorithm , a randomized on-line algorithm for the paging problem, and a proof that its performance guarantee is $O(\log k)$. This was one of the first results in the area showing that a randomized algorithm can have a strictly better performance guarantee than any deterministic on-line algorithm (no deterministic on-line algorithm can have a performance guarantee better than k).
- more...
- full text
- bibtex
- google scholar
- doi
- link
Technical Report, Computer Science Department, Princeton University TR-348-91(1991)
The thesis was about on-line algorithms. The main results were published in papers described above. The unpublished results were mainly exploring the role of linear-programming duality in on-line algorithms, including a method for deriving a potential-function analysis from a linear-programming duality analysis, and studies of some on-line matching algorithms. There were also some minor results analyzing paging strategies with lookahead.
- more...
- full text
- bibtex
- google scholar
- link
Networks 21(2):205-221(1991)
With Robert Tarjan and James Orlin
The parametric shortest path problem is to find the shortest paths in graph where the edge costs are of the form $w_{ij}+\lambda$, where each $w_{ij}$ is constant and $\lambda$ is a parameter that varies. The problem is to find shortest path trees for every possible value of $\lambda$.

The minimum-balance problem is to find a ``weighting'' of the vertices so that adjusting the edge costs by the vertex weights yields a graph in which, for every cut, the minimum weight of any edge crossing the cut in one direction equals the minimum weight of any edge crossing the cut in the other direction.

The paper presents the fastest known algorithms for both problems. The algorithms run in $O(nm+n^2\log n)$ time. The paper also describes empirical studies of the algorithms on random graphs, suggesting that the expected time for finding minimum-mean cycles (an important special case of both problems) is $O(m + n \log n)$.
- more...
- full text
- bibtex
- google scholar
- doi
- link
Technical Report, Computer Science Department, Princeton University TR-317-91(1991)
With Laszlo Lovasz
This technical report presented notes from the first eight lectures of the class Many Models of Complexity taught by László Lovász at Princeton University in the fall of 1990. The topic is evasiveness of graph properties: given a graph property, how many edges of the graph an algorithm must check in the worst case before it knows whether the property holds.
- more...
- full text
- bibtex
- google scholar
- link
ACM-SIAM Symposium on Discrete Algorithms
Conference version of The K-Server Dual and Loose Competitiveness for Paging.
- more...
- full text
- bibtex
- google scholar
- link

neal young / publications

Main Navigation

publicationsCurrent page is: publications

publicationsCurrent page is: publications

Breadcrumbs

Search

neal young / publications

Main Navigation

publicationsCurrent page is: publications

publicationsCurrent page is: publications