Shark 0.9.1 is a maintenance release that stabilizes 0.9.0, which bumps up Scala compatibility to 2.10.3 and Hive compliance to 0.11. The core dependencies for this version are:
Scala 2.10.3
Spark 0.9.1
AMPLab’s Hive 0.9.0
(Optional) Tachyon 0.4.1
Hive Compatibility
We’ve extensively upgraded the Shark codebase to be Hive 0.11 compliant. Existing users can now launch Shark as a drop-in replacement for operating with existing Hive 0.11 metastores.
Two major components added during this upgrade process are support for new windowing and analytics functions, and SharkServer2. More detail is available in the respective sections below.
Analytics Functions
Windowing functions
Shark now supports the windowing functions added by HIVE-896. All of the supported window functions operate based on the SQL standard.
Rollups
Shark also supports enhanced aggregation in the form of rollups. This feature allows users to compute aggregations over multiple groups easily and efficiently. For example, the following query uses the new GROUPING SETS clause:
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)
The above query is equivalent to running multiple aggregations as follows:
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION ALL
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
SharkServer2
SharkServer2 is an improved Thrift server that’s compatible with the HiveServer2 developed in Hive 0.11. SharkServer2 allows for hosting concurrent client connections and query executions. Semantics are the same as for HiveServer2:
To start a SharkServer2:
$ bin/shark -service sharkserver2
To connect to the server from remote clients, you can use JDBC with the network address and port that the server is listening on. For example, to use the Beeline CLI:
<table name>_cached now caches the table in the MEMORY_ONLY ephemeral layer (Spark block manager), which is consistent with pre-0.8.0 behavior. Previously, Shark was using MEMORY, which incurs added latency in DDL commands due to writes to both persistent and ephemeral storage.
CACHE <table name> IN <cache type> can be used to specify the cache layer for a table. This is equivalent to ALTER TABLE <table name> TBLPROPERTIES('shark.cache'='<cache type>'). <cache type> can be MEMORY, MEMORY_ONLY, or TACHYON.
Maven Central and Easier Deployment
To simplify deployment and installation, we’ve uploaded all AMPLab Hive and Shark binaries to Maven Central under the edu.berkeley.cs.shark organization. HIVE_HOME is now obsolete, and Hive binary downloads are no longer required to begin running Shark. Instead, simply download the Shark binaries, and execute SHARK_HOME/bin/shark.
To include Shark as a dependency in your application:
For an sbt build file:
Delta encoding for int and long primitives stored in columnar format. To save memory. we only store differences between consecutive values in each int or long column.
Table scans over Hive-partitioned tables (i.e., tables created using PARTITIONED BY clause) now broadcast a single configuration for each table scan, as opposed to broadcasts linear in the number of partitions for that table.