Hive provides SQL type querying language for the ETL purpose on top of Hadoop file system.
Hive Query language (HiveQL) provides SQL type environment in Hive to work with tables, databases, queries.
We can have a different type of Clauses associated with Hive to perform different type data manipulations and querying. For better connectivity with different nodes outside the environment. HIVE provide JDBC connectivity as well.
Hive queries provides the following features:
Creating Table in Hive
Before initiating with our main topic for this tutorial, first we will create a table to use it as references for the following tutorial.
Here in this tutorial, we are going to create table “employees_guru” with 6 columns.
Creating Table in Hive" width="950" height="152" />
From the above screen shot,
The ORDER BY syntax in HiveQL is similar to the syntax of ORDER BY in SQL language.
Order by is the clause we use with “SELECT” statement in Hive queries, which helps sort data. Order by clause use columns on Hive tables for sorting particular column values mentioned with Order by. For whatever the column name we are defining the order by clause the query will selects and display results by ascending or descending order the particular column values.
If the mentioned order by field is a string, then it will display the result in lexicographical order. At the back end, it has to be passed on to a single reducer.
From the Above screen shot, we can observe the following
SELECT * FROM employees_guru ORDER BY Department;
Group by clause use columns on Hive tables for grouping particular column values mentioned with the group by. For whatever the column name we are defining a “groupby” clause the query will selects and display results by grouping the particular column values.
For example, in the below screen shot it’s going to display the total count of employees present in each department. Here we have “Department” as Group by value.
From the above screenshot, we will observe the following
SELECT Department, count(*) FROM employees_guru GROUP BY Department;
Sort by clause performs on column names of Hive tables to sort the output. We can mention DESC for sorting the order in descending order and mention ASC for Ascending order of the sort.
In this sort by it will sort the rows before feeding to the reducer. Always sort by depends on column types.
For instance, if column types are numeric it will sort in numeric order if the columns types are string it will sort in lexicographical order.
From the above screen shot we can observe the following:
SELECT * from employees_guru SORT BY Id DESC;
Cluster By used as an alternative for both Distribute BY and Sort BY clauses in Hive-QL.
Cluster BY clause used on tables present in Hive. Hive uses the columns in Cluster by to distribute the rows among reducers. Cluster BY columns will go to the multiple reducers.
For example, Cluster By clause mentioned on the Id column name of the table employees_guru table. The output when executing this query will give results to multiple reducers at the back end. But as front end it is an alternative clause for both Sort By and Distribute By.
This is actually back end process when we perform a query with sort by, group by, and cluster by in terms of Map reduce framework. So if we want to store results into multiple reducers, we go with Cluster By.
From the above screen shot we are getting the following observations:
SELECT Id, Name from employees_guru CLUSTER BY Id;
Distribute BY clause used on tables present in Hive. Hive uses the columns in Distribute by to distribute the rows among reducers. All Distribute BY columns will go to the same reducer.
From the above screenshot, we can observe the following
SELECT Id, Name from employees_guru DISTRIBUTE BY Id;