abadi-sigmod08 inproceedings SIGMOD Conference Paper 2008 Column-Stores vs. Row-Stores: How Different Are They Really? Vancouver, Canada SIGMOD 2008 oltp-perf inproceedings SIGMOD Conference Paper 2008 OLTP Through the Looking Glass, And What We Found There Vancouver, Canada SIGMOD 2008 abadi-rdf inproceedings 246 VLDB Conference Paper 2007 Best Paper Award Scalable Semantic Web Data Management Using Vertical Partitioning Efficient management of RDF data is an important factor in realizing the Semantic Web vision. Performance and scalability issues are becoming increasingly pressing as Semantic Web technology is applied to real-world applications. In this paper, we examine the reasons why current data management solutions for RDF data scale poorly, and explore the fundamental scalability limitations of these approaches. We review the state of the art for improving performance for RDF databases and consider a recent suggestion, 'property tables'. We then discuss practically and empirically why this solution has undesirable features. As an improvement, we propose an alternative solution: vertically partitioning the RDF data. We compare the performance of vertical partitioning with prior art on queries generated by a Web-based RDF browser over a large-scale (more than 50 million triples) catalog of library data. Our results show that a vertical partitioned schema achieves similar performance to the property table technique while being much simpler to design. Further, if a column-oriented DBMS (a database architected specially for the vertically partitioned case) is used instead of a row-oriented DBMS, another order of magnitude performance improvement is observed, with query times dropping from minutes to several seconds. http://cs-www.cs.yale.edu/homes/dna/papers/abadirdf.pdf Vienna, Austria VLDB 2007 hstore inproceedings 444 VLDB Conference Paper 2007 The End of an Architectural Era (It's Time for a Complete Rewrite) In previous papers, some of us predicted the end of 'one size fits all' as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major relational RDBMS vendors can be outperformed by 1-2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets. Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T., to a popular RDBMS on the standard transactional benchmark, TPC-C. We conclude that current RDBMS code lines, while attempting to be a 'one size fits all' solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of 'from scratch' specialized engines. The DBMS vendors (and research community) should start with a clean sheet of paper and design systems for tomorrow's requirements, not continue to push code lines and architectures designed for yesterday's needs. http://cs-www.cs.yale.edu/homes/dna/papers/vldb07hstore.pdf Vienna, Austria VLDB 2007 abadi-cidr inproceedings 156 CIDR Conference Paper 2007 Column Stores for Wide and Sparse Data While it is generally accepted that data warehouses and OLAP workloads are excellent applications for column-stores, this paper speculates that column-stores may well be suited for additional applications. In particular we observe that column-stores do not see a performance degradation when storing extremely wide tables, and column-stores handle sparse data very well. These two properties lead us to conjecture that column-stores may be good storage layers for Semantic Web data, XML data, and data with GEM-style schemas. http://cs-www.cs.yale.edu/homes/dna/papers/abadicidr07.pdf Asilomar, CA, USA CIDR 2007 cstore-mat inproceedings 327 ICDE Conference Paper 2007 466--475 Materialization Strategies in a Column-Oriented DBMS There has been renewed interest in column-oriented database architectures in recent years. For read-mostly query workloads such as those found in data warehouse and decision support applications, column-stores have been shown to perform particularly well relative to row-stores. In order for column-stores to be readily adopted as a replacement for row-stores, however, they must present the same interface to client applications as do row stores, which implies that they must output row-store-style tuples. Thus, the input columns stored on disk must be converted to rows at some point in the query plan, but the optimal point at which to do the conversion is not obvious. This problem can be considered as the opposite of the projection problem in row-store systems: while row-stores need to determine where in query plans to place projection operators to make tuples narrower, column-stores need to determine when to combine single-column projections into wider tuples. This paper describes a variety of strategies for tuple construction and intermediate result representations and provides a systematic evaluation of these strategies. http://cs-www.cs.yale.edu/homes/dna/papers/abadiicde2007.pdf Istanbul, Turkey ICDE 2007 cstore-perf inproceedings 354 VLDB Conference Paper 2006 487--498 Performance Tradeoffs in Read-Optimized Databases Database systems have traditionally optimized performance for write-intensive workloads. Recently, there has been renewed interest in architectures that optimize read performance by using column-oriented data representation and light-weight compression. This previous work has shown that under certain broad classes of workloads, column-based systems can outperform row-based systems. Previous work, however, has not characterized the precise conditions under which a particular query workload can be expected to perform better on a column-oriented database. In this paper we first identify the distinctive components of a read-optimized DBMS and describe our implementation of a high-performance query engine that can operate on both row and column-oriented data. We then use our prototype to perform an in-depth analysis of the tradeoffs between column and row-oriented architectures. We explore these tradeoffs in terms of disk bandwidth, CPU cache latency, and CPU cycles. We show that for most database workloads, a carefully designed column system can outperform a carefully designed row system, sometimes by an order of magnitude. We also present an analytical model to predict whether a given workload on a particular hardware configuration is likely to perform better on a row or column-based system. http://cs-www.cs.yale.edu/homes/dna/papers/VLDB06.pdf Seoul, Korea VLDB 2006 cstore-comp inproceedings 265 SIGMOD Conference Paper 2006 671--682 Integrating Compression and Execution in Column-Oriented Database Systems Column-oriented database system architectures invite a re-evaluation of how and when data in databases is compressed. Storing data in a column-oriented fashion greatly increases the similarity of adjacent records on disk and thus opportunities for compression. The ability to compress many adjacent tuples at once lowers the per-tuple cost of compression, both in terms of CPU and space overheads. In this paper, we discuss how we extended C-Store (a column-oriented DBMS) with a compression sub-system. We show how compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems. We then evaluate a set of compression schemes and show that the best scheme depends not only on the properties of the data but also on the nature of the query workload. http://cs-www.cs.yale.edu/homes/dna/papers/abadisigmod06.pdf Chicago, IL, USA SIGMOD 2006 reed inproceedings 292 VLDB Conference Paper 2005 769--780 REED: Robust, Efficient Filtering and Event Detection in Sensor Networks This paper presents a set of algorithms for efficiently evaluating join queries over static data tables in sensor networks. We describe and evaluate three algorithms that take advantage of distributed join techniques. Our algorithms are capable of running in limited amounts of RAM, can distribute the storage burden over groups of nodes, and are tolerant to dropped packets and node failures. REED is thus suitable for a wide range of event-detection applications that traditional sensor network database and data collection systems cannot be used to implement. http://cs-www.cs.yale.edu/homes/dna/papers/abadivldb05.pdf Trondheim, Norway VLDB 2005 cstore inproceedings 170 VLDB Conference Paper 2005 553--564 C-Store: A Column-Oriented DBMS This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of column-oriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures. We present preliminary performance data on a subset of TPC-H and show that the system we are building, C-Store, is substantially faster than popular commercial products. Hence, the architecture looks very encouraging. http://cs-www.cs.yale.edu/homes/dna/papers/vldb.pdf Trondheim, Norway VLDB 2005 borealis inproceedings 143 CIDR Conference Paper 2005 The Design of the Borealis Stream Processing Engine Borealis is a second-generation distributed stream processing engine that is being developed at Brandeis University, Brown University, and MIT. Borealis inherits core stream processing functionality from Aurora and distribution functionality from Medusa. Borealis modifies and extends both systems in non-trivial and critical ways to provide advanced capabilities that are commonly required by newly-emerging stream processing applications. In this paper, we outline the basic design and functionality of Borealis. Through sample real-world applications, we motivate the need for dynamically revising query results and modifying query specifications. We then describe how Borealis addresses these challenges through an innovative set of features, including revision records, time travel, and control lines. Finally, we present a highly flexible and scalable QoS-based optimization model that operates across server and sensor networks and a new fault-tolerance model with flexible consistency-availability trade-offs. http://cs-www.cs.yale.edu/homes/dna/papers/cidr05.pdf Asilomar, CA, USA CIDR 2005 aurora misc 984 September VLDB Journal Journal Article 2003 120--139 VLDB Journal, 12(2) Aurora: A New Model and Architecture for Data Stream Management This paper describes the basic processing model and architecture of Aurora, a new system to manage data streams for monitoring applications. Monitoring applications differ substantially from conventional business data processing. The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires one to rethink the fundamental architecture of a DBMS for this application area. In this paper, we present Aurora, a new DBMS currently under construction at Brandeis University, Brown University, and M.I.T. We first provide an overview of the basic Aurora model and architecture and then describe in detail a stream-oriented set of operators. http://cs-www.cs.yale.edu/homes/dna/papers/vldb095.pdf 2003-09 barton-benchmark techreport 357 Technical Report 2007 Using The Barton Libraries Dataset As An RDF Benchmark This report describes the Barton Libraries RDF dataset and Longwell query benchmark that we use for our recent VLDB paper on Scalable Semantic Web Data Management Using Vertical Partitioning http://cs-www.cs.yale.edu/homes/dna/papers/bench.pdf MIT MIT-CSAIL-TR-2007-036 2007 sensor-stream-integration misc 116 VLDB Demonstration 2004 1361--1364 Demonstration. VLDB An Integration Framework for Sensor Networks and Data Stream Management Systems This demonstration shows an integrated query processing environment where users can seamlessly query both a data stream management system and a sensor network with one query expression. By integrating the two query processing systems, the optimization goals of the sensor network (primarily power) and server network (primarily latency and quality) can be unified into one quality of service metric. http://cs-www.cs.yale.edu/homes/dna/papers/vldb04.pdf Toronto, Canada 2004 aurora-demo misc 151 SIGMOD Demonstration 2003 666-666 Demonstration. SIGMOD Aurora: A Data Stream Management System Streams are continuous data feeds generated by such sources as sensors, satellites, and stock feeds. Monitoring applications track data from numerous streams, filtering them for signs of abnormal activity, and processing them for purposes of filtering, aggregation, reduction, and correlation. Aurora is a general-purpose data stream manager that is being designed and implemented (at Brandeis University, Brown University, and M.I.T.) to efficiently support a variety of real-time monitoring applications. http://cs-www.cs.yale.edu/homes/dna/papers/AuroraDemo.pdf San Diego, CA, USA 2003 vcoko misc 111 SIGMOD Demonstration 2002 617--617 Demonstration. SIGMOD Visual COKO: A Debugger for Query Optimizer Development Query optimization generates plans to retrieve data requested by queries. Query rewriting, which is the first step of this process, rewrites a query expression into an equivalent form to prepare it for plan generation. COKO-KOLA introduced a new approach to query rewriting that enables query rewrites to be formally verified using an automated theorem prover. KOLA is a language for expressing term rewriting rules that can be 'fired' on query expressions. COKO is a language for expressing query rewriting transformations that are too complex to express with simple KOLA rules. COKO is a programming language designed for query optimizer development. Programming languages require debuggers, and in this demonstration, we illustrate our COKO debugger: Visual COKO. Visual COKO enables a query optimization developer to visually trace the execution of a COKO transformation. At every step of the transformation, the developer can view a tree-display that illustrates how the original query expression has evolved. http://cs-www.cs.yale.edu/homes/dna/papers/vcoko.pdf Madison, Wisconsin 2002 abadi-phd misc 1388 Thesis 2008 MIT PhD Dissertation Query Execution in Column-Oriented Database Systems There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage interface: store the table row-by-row, or store the table column-by-column. Historically, database system implementations and research have focused on the row-by row data layout, since it performs best on the most common application for database systems: business transactional data processing. However, there are a set of emerging applications for database systems for which the row-by-row layout performs poorly. These applications are more analytical in nature, whose goal is to read through the data to gain new insight and use it to drive decision making and planning. In this dissertation, we study the problem of poor performance of row-by-row data layout for these emerging applications, and evaluate the column-by-column data layout opportunity as a solution to this problem. There have been a variety of proposals in the literature for how to build a database system on top of column-by-column layout. These proposals have different levels of implementation effort, and have different performance characteristics. If one wanted to build a new database system that utilizes the column-by-column data layout, it is unclear which proposal to follow. This dissertation provides (to the best of our knowledge) the only detailed study of mutliple implementation approaches of such systems, categorizing the different approaches into three broad categories, and evaluating the tradeoffs between approaches. We conclude that building a query executer specifically designed for the column-by-column query layout is essensial to acheive good performance. Consequently, we describe the implementation of C-Store, a new database system with a storage layer and query executer built for column-by-column data layout. We introduce three new query execution technqiues that significantly improve performance. First, we look at the problem of integrating compression and execution so that the query executer is capable of directly operating on compressed data. This improves performance by improving I/O (less data needs to be read off disk), and CPU (the data need not be decompressed). We describe our solution to the problem of executer extensibility -- how can new compression techniques be added to the system without having to rewrite the operator code? Second, we analyze the problem of tuple construction (stitching together attributes from multiple columns into a row-oriented ``tuple''). Tuple construction is required when operators need to access multiple attributes from the same tuple; however, if done at the wrong point in a query plan, a significant performance penalty is paid. We introduce an analytical model and some heuristics to use that help decide when in a query plan tuple construction should occur. Third, we introduce a new join technique, the ``invisible join'' that improves performance of a specific type of join that is common in the applications for which column-by-column data layout is a good idea. Finally, we benchmark performance of the complete C-Store database system against other column-oriented database system implementation approachs, and against row-oriented databases. We benchmark two applications. The first application is a typical analytical application for which column-by-column data layout is known to outperform row-by-row data layout. The second application is another emerging application, the Semantic Web, for which column-oriented database systems are not currently used. We find that on the first application, the complete C-Store system performed 10 to 18 times faster than alternative column-store implementation approaches, and 6 to 12 times faster than a commercial database system that uses a row-by-row data layout. On the Semantic Web application, we find that C-Store outperforms other state-of-the-art data management techniques by an order of magnitude, and outperforms other common data management technqiues by almost two orders of magnitude. Benchmark queries, which used to take multiple minutes to execute, can now be answered in several seconds. http://cs-www.cs.yale.edu/homes/dna/papers/abadiphd.pdf PhD Thesis 2008 abadi-anaphora misc 164 Thesis 2003 Cambridge University MPhil Dissertation Comparing Domain-Specific and Non-Domain-Specific Anaphora Resolution Techniques Three different pronominal anaphora resolution techniques are examined. The first two techniques compare traditional salience-based approaches when different amounts of syntactic information are available. The improvement in pronoun resolution precision is quantified when a large scale grammar is used to extract detailed syntactic information rather than inferring this information robustly using pattern matching. The third technique uses domain knowledge instead of syntactic information to resolve pronouns. The domain knowledge required for this algorithm can be automatically acquired from a database backend schema representation of the domain. Each of these three techniques is evaluated separately, and then the domain-specific and non-domain-specific algorithms are combined and evaluated. http://cs-www.cs.yale.edu/homes/dna/papers/FinalMPhil.pdf M.Phil. Thesis 2003 abadi-ugrad-thesis misc 132 Thesis 2002 Brandeis University Senior Honors Thesis Visual COKO: A Debugger for Query Optimizer Development Query optimization generates plans to retrieve data requested by queries, and query rewriting (rewriting a query expression into an equivalent form to prepare it for plan generation) is typically the first step. COKO-KOLA introduced a new approach to query rewriting that enables query rewrites to be formally verified using an automated theorem prover. KOLA is a language for expressing rewriting rules that can be fired on query expressions. COKO is a language for expressing query rewriting transformations that are too complex to express with simple KOLA rules. COKO is a programming language designed for query optimizer development. Programming languages require debuggers, and this paper describes a COKO debugger: Visual COKO. Visual COKO enables a query optimization developer to visually trace the execution of a COKO transformation. At every step of the transformation, the developer can view a tree-display that illustrates how the original query expression has evolved. Rule-based query rewriting and the COKO-KOLA project are described for background. Then the COKO syntax is summarized from the point of view of the COKO programmer using an example query transformation that converts query predicates to conjunctive normal form. Visual COKO is described and instructions for its use are presented. Finally, a description of its implementation is given. http://cs-www.cs.yale.edu/homes/dna/papers/VisualCOKOThesis.pdf Undergraduate Thesis 2002 Hatoun Matt Hatoun Cetintemel Ugur Cetintemel Harizopoulos Stavros Harizopoulos Madden Samuel R. Madden Schuler Jorg Schuler Tatbul Nesime Tatbul Lin Amerson Lin Ahmad Yanif Ahmad Convey Christian Convey Abadi Daniel J. Abadi Yan R. Yan Stavros Harizopoulos Stavros Harizopoulos, Daniel J. Abadi, Samuel R. Madden, Cherniack Mitch Cherniack O'Neil Elizabeth J. O'Neil Hollenbach Kate Hollenbach Erwin Christina Erwin Ferreira Miguel Ferreira Galvez Eddie Galvez Batkin Adam Batkin Xing Ying Xing Balazinska Magdalena Balazinska Hwang Jeong-Hyon Hwang Daniel J. Abadi Daniel J. Abadi, Samuel R. Madden, Nabil Hachem Ryvkina Esther Ryvkina Lindner Wolfgang Lindner Stonebraker Michael Stonebraker Zdonik Stan B. Zdonik Lau Edmond Lau Singer A. Singer Carney Don Carney Marcus Adam Marcus Myers Daniel S. Myers Lee Sangdon Lee Chen Xuedong Chen O'Neil Patrick E. O'Neil Helland Pat Helland Rasin Alexander Rasin Tran Nga Tran DeWitt David J. DeWitt Hachem Nabil Hachem Liang Velen Liang Maskey Anurag S. Maskey