GRAPH PROCESSING
ASPLOS'17 KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations
K. Vora, R. Gupta, and G. Xu
ACM 22nd International Conference on Architectural Support for Programming Languages and Operating Systems,
pages 237-251, Xi'an, China, April 2017.     [ Abstract ] [ Paper ] [ Presentation ]
KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations
Continuous processing of a streaming graph iteratively maintains an approximate result of the computation on a recent version of the graph. Upon a user query, the accurate result on the current graph can be quickly computed by feeding the approximate results to the iterative computation — a form of incremental computation that corrects the (small amount of) error in the approximate result. Despite the effectiveness of this approach in processing growing graphs, it is not generally applicable when edge deletions are present — existing approximations can lead to either incorrect results (e.g., for monotonic algorithms the computation terminates at an incorrect minima/maxima) or poor performance (e.g., with approximations, convergence takes longer than performing the computation from scratch).

In this paper we present KickStarter, that, for a general class of monotonic graph algorithms, is able to trim the approximations to a subset of vertex values whose use preserves correctness of results and yet allows a majority of existing approximations to be directly used for efficiency. Our experiments with four streaming algorithms on five real-world graphs demonstrate that trimming not only produces correct results but also accelerates these algorithms by 8.5–23.7x.
ASPLOS'17 CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
K. Vora, C. Tian, R. Gupta, and Z. Hu
ACM 22nd International Conference on Architectural Support for Programming Languages and Operating Systems,
pages 223-236, Xi'an, China, April 2017.     [ Abstract ] [ Paper ] [ Presentation ]
CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Existing distributed asynchronous graph processing systems employ checkpointing to capture globally consistent snapshots and rollback all machines to most recent checkpoint to recover from machine failures. In this paper we argue that recovery in distributed asynchronous graph processing does not require the entire execution state to be rolled back to a globally consistent state due to the relaxed asynchronous execution semantics. We define the properties required in the recovered state for it to be usable for correct asynchronous processing and develop CoRAL a lightweight checkpointing and recovery algorithm. First, this algorithm carries out confined recovery that only rolls back graph execution states of the failed machines to affect recovery. Second, it relies upon lightweight checkpoints that capture locally consistent snapshots with a reduced peak network bandwidth requirement. Our experiments using real-world graphs show that our technique recovers from failures and finishes processing 1.5-3.2x faster compared to the traditional asynchronous checkpointing and recovery mechanism when failures impact 6-37% of the machines in the cluster. Moreover, capturing locally consistent snapshots significantly reduces intermittent high bandwidth usage required to save the snapshots -- the average reduction in 99th percentile bandwidth is 22-51% to maintain 1-6 snapshot replicas.
USENIX
ATC'16
Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing
K. Vora, G.Xu, and R. Gupta
USENIX Annual Technical Conference,
pages 507-522, Denver, Colorado, June 2016.     [ Abstract ] [ Paper ] [ Presentation ]
Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing
Single-PC, disk-based processing of big graphs has recently gained much popularity. At the core of an efficient disk-based system is a well-designed partition structure that can minimize random disk accesses. All existing systems use static partitions that are created before processing starts. These partitions have static layouts and are loaded entirely into memory in every single iteration even though much of the edge data is not changed across many iterations, causing these unchanged edges to have zero new impact on the computation of vertex values.

This work provides a general optimization that removes this I/O inefficiency by employing dynamic partitions whose layouts are dynamically adjustable. We have implemented this optimization in GraphChi -- a representative out-of-core vertex-centric graph system -- which sped up GraphChi by 1.5-2.8x on six large graphs. Our idea is generally applicable to other systems as well.
TACO'16 Synergistic Analysis of Evolving Graphs
K. Vora, R. Gupta, and G. Xu
ACM Transactions on Architecture and Code Optimization,
Volume 13, Issue 4, Article No. 32, 27 pages, December 2016.     [ Abstract ] [ Paper ]
Synergistic Analysis of Evolving Graphs
Evolving graph processing involves repeating analyses, that are often iterative, over multiple snapshots of the graph corresponding to different points in time. Since the snapshots of an evolving graph share a great number of vertices and edges, traditional approaches that process these snapshots one at a time without exploiting this overlap contain much wasted effort on both data loading and computation, making them extremely inefficient. In this paper, we identify major sources of inefficiencies and present two optimization techniques to address them. First, we propose a technique for amortizing the fetch cost by merging fetching of values for different snapshots of the same vertex. Second, we propose a technique for amortizing the processing cost by feeding values computed by earlier snapshots into later snapshots. We have implemented these optimizations in two distributed graph processing systems, namely GraphLab and ASPIRE. Our experiments with multiple real evolving graphs and algorithms show that, on average fetch amortization speeds up execution of GraphLab and ASPIRE by 5.2× and 4.1× respectively. Amortizing the processing cost yields additional average speedups of 2× and 7.9× respectively.
HPDC'16 Efficient Processing of Large Graphs via Input Reduction
A. Kusum, K. Vora, R. Gupta, and I. Neamtiu
25th ACM International Symposium on High-Performance Parallel and Distributed Computing,
pages 245-257, Kyoto, Japan, May-June 2016.     [ Abstract ] [ PDF ] [ Paper ] [ Presentation ]
Efficient Processing of Large Graphs via Input Reduction
Large-scale parallel graph analytics involves executing iterative algorithms (e.g., PageRank, Shortest Paths, etc.) that are both data- and compute-intensive. In this work we construct faster versions of iterative graph algorithms from their original counterparts using input graph reduction. A large input graph is transformed into a small graph using a sequence of input reduction transformations. Savings in execution time are achieved using our two phased processing model that effectively runs the original iterative algorithm in two phases: first, using the reduced input graph to gain savings in execution time; and second, using the original input graph along with the results from the first phase for computing precise results. We propose several input reduction transformations and identify the structural and non-structural properties that they guarantee, which in turn are used to ensure the correctness of results while using our two phased processing model. We further present a unified input reduction algorithm that efficiently applies a non-interfering sequence of simple local input reduction transformations. Our experiments show that our transformation techniques enable significant reductions in execution time (1.25x-2.14x) while achieving precise final results for most of the algorithms. For cases where precise results cannot be achieved, the relative error remains very small (at most 0.065).
HPDC'14 CuSha: Vertex-Centric Graph Processing on GPUs
F. Khorasani, K. Vora, R. Gupta, and L.N. Bhuyan
23rd ACM International Symposium on High Performance Parallel and Distributed Computing,
pages 239-251, Vancouver, Canada, June 2014.     [ Abstract ] [ Paper ] [ Presentation ]
CuSha: Vertex-Centric Graph Processing on GPUs
Vertex-centric graph processing is employed by many popular algorithms (e.g., PageRank) due to its simplicity and efficient use of asynchronous parallelism. The high compute power provided by SIMT architecture presents an opportunity for accelerating these algorithms using GPUs. Prior works of graph processing on a GPU employ Compressed Sparse Row (CSR) form for its space-efficiency; however, CSR suffers from irregular memory accesses and GPU underutilization that limit its performance. In this paper, we present CuSha, a CUDA-based graph processing framework that overcomes the above obstacle via use of two novel graph representations: G-Shards and Concatenated Windows (CW). G-Shards uses a concept recently introduced for non-GPU systems that organizes a graph into autonomous sets of ordered edges called shards. CuSha's mapping of GPU hardware resources on to shards allows fully coalesced memory accesses. CW is a novel representation that enhances the use of shards to achieve higher GPU utilization for processing sparse graphs. Finally, CuSha fully utilizes the GPU power by processing multiple shards in parallel on GPU's streaming multiprocessors. For ease of programming, CuSha allows the user to define the vertex-centric computation and plug it into its framework for parallel processing of large graphs. Our experiments show that CuSha provides significant speedups over the state-of-the-art CSR-based virtual warp-centric method for processing graphs on GPUs.
OOPSLA'14 ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM
K. Vora, S-C. Koduru, and R. Gupta
ACM SIGPLAN International Conference on Object Oriented Programming Systems, Languages and Applications,
pages 861-878, Portland, Oregon, October 2014.     [ Abstract ] [ Paper ] [ Presentation ]
ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM
Many vertex-centric graph algorithms can be expressed using asynchronous parallelism by relaxing certain read-after-write data dependences and allowing threads to compute vertex values using stale (i.e., not the most recent) values of their neighboring vertices. We observe that on distributed shared memory systems, by converting synchronous algorithms into their asynchronous counterparts, algorithms can be made tolerant to high inter-node communication latency. However, high inter-node communication latency can lead to excessive use of stale values causing an increase in the number of iterations required by the algorithms to converge. Although by using bounded staleness we can restrict the slowdown in the rate of convergence, this also restricts the ability to tolerate communication latency. In this paper we design a relaxed memory consistency model and consistency protocol that simultaneously tolerate communication latency and minimize the use of stale values. This is achieved via a coordinated use of best effort refresh policy and bounded staleness. We demonstrate that for a range of asynchronous graph algorithms and PDE solvers, on an average, our approach outperforms algorithms based upon: prior relaxed memory models that allow stale values by at least 2.27x; and Bulk Synchronous Parallel (BSP) model by 4.2x. We also show that our approach frequently outperforms GraphLab, a popular distributed graph processing framework.

OTHER AREAS
IJPP'17 Software Speculation on Caching DSMs
S-C. Koduru, K. Vora, and R. Gupta
International Journal of Parallel Programming,
20 pages, April 2017.     [ Abstract ] [ Paper ]
Software Speculation on Caching DSMs
Clusters with caching DSMs deliver programmability and performance by supporting shared-memory programming model and tolerating communication latency of remote fetches via caching. The input of a data parallel program is partitioned across machines in the cluster while the DSM transparently fetches and caches remote data as needed by the application. Irregular applications are challenging to parallelize because the input related data dependences that manifest at runtime require the use of speculation for efficiently exploiting parallelism. By speculating that there are no cross iteration dependences, multiple iterations of a data parallel loop are executed in parallel using locally cached copies of data; the absence of dependences is validated before committing the speculatively computed results. In this paper we show that in irregular data-parallel applications, while caching helps tolerate long communication latencies, using a value read from the cache in a computation can lead to misspeculation, and thus aggressive caching can degrade performance due to increased misspeculation rate. To limit misspeculation rate we present optimizations for distributed speculation on caching based DSMs that decrease the cost of misspeculation check and speed up the re-execution of misspeculated recomputations. These optimizations give speedups of 2.24x for graph coloring, 1.71x for connected components, 1.88x for community detection, 1.32x for shortest path, and 1.74x for pagerank over baseline parallel executions.
OOPSLA'15 RAIVE: Runtime Assessment of Floating-Point Instability by Vectorization
W-C. Lee, T. Bao, Y. Zheng, X. Zhang, K. Vora, and R. Gupta
ACM SIGPLAN International Conference on Object Oriented Programming Systems, Languages and Applications,
pages 623-638, Pittsburgh, Pennsylvania, October 2015.     [ Abstract ] [ Paper ]
RAIVE: Runtime Assessment of Floating-Point Instability by Vectorization
Floating point representation has limited precision and inputs to floating point programs may also have errors. Consequently, during execution, errors are introduced, propagated, and accumulated, leading to unreliable outputs. We call this the instability problem. We propose RAIVE, a technique that identifies output variations of a floating point execution in the presence of instability. RAIVE transforms every floating point value to a vector of multiple values – the values added to create the vector are obtained by introducing artificial errors that are upper bounds of actual errors. The propagation of artificial errors models the propagation of actual errors. When values in vectors result in discrete execution differences (e.g., following different paths), the execution is forked to capture the resulting output variations. Our evaluation shows that RAIVE can precisely capture output variations. Its overhead (340%) is 2.43 times lower than the state of the art.
CLUSTER'15 Optimizing Caching DSM for Distributed Software Speculation
S-C. Koduru, K. Vora, and R. Gupta
IEEE International Conference on Cluster Computing,
pages 452-455, Chicago, Illinois, September 2015.     [ Abstract ] [ Paper ]
Optimizing Caching DSM for Distributed Software Speculation
Clusters with caching DSMs deliver programmability and performance by supporting shared-memory programming and tolerate remote I/O latencies via caching. The input to a data parallel program is partitioned across the cluster while the DSM transparently fetches and caches remote data as needed. Irregular applications, however, are challenging to parallelize because the input related data dependences that manifest at runtime require use of speculation for correct parallel execution. By speculating that there are no input related cross iteration dependences, private copies of the input can be processed by parallelizing the loop; the absence of dependences is validated before committing the computed results. We show that while caching helps tolerate long communication latencies in irregular data-parallel applications, using a cached values in a computation can lead to misspeculation and thus aggressive caching can degrade performance due to increased misspeculation rate. We present optimizations for distributed speculation on caching based DSMs that decrease the cost of misspeculation check and speed up the re-execution of misspeculated recomputations. Optimized distributed speculation achieves speedups of 2.24x for coloring, 1.71x for connected components, 1.88x for community detection, 1.32x for shortest path, and 1.74x for pagerank over unoptimized speculation.
HIPS'14 ABC2: Adaptively Balancing Computation & Communication in a DSM cluster of Multicores for Irregular Applications
S-C. Koduru, K. Vora, and R. Gupta
Workshop on High-Level Parallel Programming Models and Supportive Environments,
pages 391-400, in IEEE IPDPSW Proceedings, Phoenix, May 2014.     [ Abstract ] [ Paper ]
ABC2: Adaptively Balancing Computation & Communication in a DSM cluster of Multicores for Irregular Applications
Graph-based applications have become increasingly important in many application domains. The large graph sizes offer data level parallelism at a scale that makes it attractive to run such applications on distributed shared memory (DSM) based modern clusters composed of multicore machines. Our analysis of several graph applications that rely on speculative parallelism or asynchronous parallelism shows that the balance between computation and communication differs between applications. In this paper, we study this balance in the context of DSMs and exploit the multiple cores present in modern multicore machines by creating three kinds of threads which allows us to dynamically balance computation and communication: compute threads to exploit data level parallelism in the computation; fetch threads that replicate data into object-stores before it is accessed by compute threads; and update threads that make results computed by compute threads visible to all compute threads by writing them to DSM. We observe that the best configuration for above mechanisms varies across different inputs in addition to the variation across different applications. To this end, we design ABC2: a runtime algorithm that automatically configures the DSM using simple runtime information such as: observed object prefetch and update queue lengths. This runtime algorithm achieves speedups close to that of the best hand-optimized configurations.