Shark 0.9.1

Release date: April 10, 2014 Shark 0.9.1 is a maintenance release that stabilizes 0.9.0, which bumps up Scala compatibility to 2.10.3 and Hive compliance to 0.11. The core dependencies for this version are: - Scala 2.10.3 - Spark 0.9.1 - AMPLab’s Hive 0.9.0 - (Optional) Tachyon 0.4.1 ### Hive Compatibility We’ve extensively upgraded the Shark codebase to be Hive 0.11 compliant. Existing users can now launch Shark as a drop-in replacement for operating with existing Hive 0.11 metastores. Two major components added during this upgrade process are support for new windowing and analytics functions, and SharkServer2. More detail is available in the respective sections below. ### Analytics Functions #### Windowing functions Shark now supports the windowing functions added by [HIVE-896](https://issues.apache.org/jira/browse/HIVE-896). All of the supported window functions operate based on the SQL standard. #### Rollups Shark also supports enhanced aggregation in the form of rollups. This feature allows users to compute aggregations over multiple groups easily and efficiently. For example, the following query uses the new `GROUPING SETS` clause: ``` sql SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a) ``` The above query is equivalent to running multiple aggregations as follows: ``` sql SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION ALL SELECT a, null, SUM( c ) FROM tab1 GROUP BY a ``` ### SharkServer2 SharkServer2 is an improved Thrift server that’s compatible with the HiveServer2 developed in Hive 0.11. SharkServer2 allows for hosting concurrent client connections and query executions. Semantics are the same as for HiveServer2: To start a SharkServer2: ``` $ bin/shark -service sharkserver2 ``` To connect to the server from remote clients, you can use JDBC with the network address and port that the server is listening on. For example, to use the Beeline CLI: ``` $ bin/beeline beeline > !connect jdbc:hive2://localhost:10000/default ``` ### Usability - ` _cached` now caches the table in the `MEMORY_ONLY` ephemeral layer (Spark block manager), which is consistent with pre-0.8.0 behavior. Previously, Shark was using `MEMORY`, which incurs added latency in DDL commands due to writes to both persistent and ephemeral storage. - `CACHE IN ` can be used to specify the cache layer for a table. This is equivalent to `ALTER TABLE TBLPROPERTIES('shark.cache'=' ')`. ` ` can be `MEMORY`, `MEMORY_ONLY`, or `TACHYON`. ### Maven Central and Easier Deployment To simplify deployment and installation, we’ve uploaded all AMPLab Hive and Shark binaries to Maven Central under the `edu.berkeley.cs.shark` organization. `HIVE_HOME` is now obsolete, and Hive binary downloads are no longer required to begin running Shark. Instead, simply download the Shark binaries, and execute `SHARK_HOME/bin/shark`. To include Shark as a dependency in your application: For an sbt build file: ``` libraryDependencies ++= Seq(“edu.berkeley.cs.shark” %% “shark” % 0.9.1) ``` For Maven, in the `dependencies` section in `pom.xml`: ``` edu.berkeley.cs.shark shark 0.9.1 ``` ### Query Execution and Performance Improvements - Delta encoding for `int` and `long` primitives stored in columnar format. To save memory. we only store differences between consecutive values in each `int` or `long` column. - Table scans over Hive-partitioned tables (i.e., tables created using `PARTITIONED BY` clause) now broadcast a single configuration for each table scan, as opposed to broadcasts linear in the number of partitions for that table. ### Download Links [Shark with Hadoop 1](https://s3.amazonaws.com/spark-related-packages/shark-0.9.1-bin-hadoop1.tgz) [Shark with Hadoop 2 (cdh5)](https://s3.amazonaws.com/spark-related-packages/shark-0.9.1-bin-hadoop2.tgz) ### Credits Michael Armbrust - SharkServer bugfix, Scala 2.10 upgrade Oleg Danilov - Hive 0.11 upgrade, bug fixes Aaron Davidson - Tachyon API revamp, improved caching semantics Harvey Feng - Hive 0.11, Spark 0.9 upgrade, release manager Cheng Hao - Windowing functions, join refactor Nandu Jayakumar - Delta encoding Andy Konwinski - Build script fix Steven Leung - Bug fix for partitioned table stats ChengXiang Li - Yarn compatibility Antonio Lupher - Hive 0.11 upgrade, lateral view improvements Sundeep Narravula - Job cancellation using JDBC Brian O’Neill - Build fix Kay Ousterhout - Improved logging messages Ahir Reddy - Python support Sun Rui - Testing, analytic function support Sergey Soldatov - Hive 0.11 upgrade, serialization bug fix Henry Wang - SharkServer2 addition Reynold Xin - SparkConf integration Tian Yi - Combiner bug fix Yury Yudin - Hive 0.11 support Thanks to everyone who contributed!

shark

Hive Compatibility

Analytics Functions

Windowing functions

Rollups

Related Projects

mapbox-navigation-android

ToastFish

barcodelib

haze

Shark 0.9.1

Hive Compatibility

Analytics Functions

Windowing functions

Rollups

SharkServer2

Usability

Maven Central and Easier Deployment

Query Execution and Performance Improvements

Download Links

Credits

Related Projects

mapbox-navigation-android

ToastFish

barcodelib

haze