Leveraging your Db2 Skills with Big Data

The Challenge

The idea of the traditional data center being centered on relational database technology is quickly evolving.  Many new data sources exist today that did not exist as little as 5 years ago.  Devices such as active machine sensors on machinery, autos and aircraft, medical sensors, RFIDs as well as social media, and web click through activity are creating tremendous volumes of mostly unstructured data which cannot be possibly stored or analyzed in traditional RDMS’s. 

These new data sources are pushing companies to explore the concepts of Big Data and Hadoop architecture which is creating a new set of problems for corporate IT.  Hadoop development and administration can be complicated and time consuming.  Developing the complex MapReduce  programs to mine this data is complicated and a very specialized skill.  Companies need to invest in training their existing personnel or hire people specializing in MapReduce programming and administration. This is the very reason many enterprises have been hesitant to invest in big data applications.

 

Leveraging Existing Skills

A solution to this problem is to be able to leverage existing SQL skills for analyzing Hadoop data.  Apache Hive was the original solution to this problem.  It provides an open source SQL interface to Hadoop.  This allows a person with basic SQL skills to run analytics on Hadoop data using a SQL language called HiveQL without the complexity of MapReduce. 

Apache Hive
So, what exactly is Hive? Apache Hive is a data warehouse system for Hadoop. Hive is not a database, it is a metastore which holds the table structure definitions that you define when you create a hive table.  This repository is known as the HCatalog and is actually a relational database.  The RDBMS is typically MySQL, Postgres, or Oracle.   It is important to understand that Hive does not provide an OLTP type capability to Hadoop Queries.  HiveQL will be translated into MapReduce jobs under the covers.  It is really best suited for long running batch type queries due to the performance limitations of this process.

Hive is essentially 3 Things:

  1. MapReduce execution engine
  2. Storage model
  3. Metastore

 

Hive tables can be partitioned to help improve performance:


 

Db2 Big SQL

IBM’s Db2 Big SQL takes the Hive SQL capability to a higher level.  Big SQL is based on Db2 MPP architecture and sits on top of Hive.  Big SQL replaces MapReduce with MPP architecture which is much faster and provides high concurrency enabling a much closer representation to an OLTP experience.

For a person with Db2 background Big SQL makes Hadoop data easily accessible.  It has fully ANSI compliant SQL plus the syntax for DDL and DML are nearly identical to native Db2 as well.

Here is an example of a create table statement:

create hadoop table users

(

  id             int                   not null primary key,

  office_id int                  null,

  fname     varchar(30)   not null,

  lname     varchar(30)   not null)

row format delimited

 fields terminated by '|'

stored as textfile;

Notice the use of “not null” and “primary key” in the definitions.   This syntax is unique to Big SQL.  These keywords are not actually enforced in Hadoop but because it is Db2 at its core the information is fed to the optimizer and will undergo query rewrite and optimization very similar to Db2.   Defining RI relationships in these tables allows the optimizer to be more intelligent about join order just like Db2.  Once this DDL is executed, the metadata will be stored in the Hive HCatalog just the same as native Hive tables.

So now you may want to create a view.  Here again the syntax is identical to Db2:

create view my_users as

select fname, lname from myschema.users where id > 100; 

Other features that should look familiar to you:

  1. “Native Tables” with full transactional support on the Head Node:
    • Row oriented, traditional DB2 tables
    • Materialized Query Tables
  2. GET SNAPSHOT / snapshot table functions
  3. RUNSTATS command (db2) à ANALYZE command (Big SQL)
  4. Row and Column Security
  5. Federation / Fluid Query
  6. Views
  7. SQL PL Stored Procedures & UDFs
  8. Workload Manager
  9. System Temporary Table Spaces to support sort overflows
  10. User Temporary Table Spaces for Declared Global Temporary Tables
  11. HADR for Head Node
  12. Oracle PL/SQL support
  13. Declared Global Temporary Tables
  14. Time Travel Queries

 

Conclusion

Currently there are a myriad of  SQL engines for Hadoop available.  Different engines solve different problems.  Most likely no single SQL engine will address all your modern data warehousing needs or use cases.  Depending on how your organization is planning on using Hadoop you will most likely use a combination of SQL engines.  For long running batch queries you may want to use native Hive, for simple ad-hoc queries you may use native Apache Spark SQL, and for complex BI type of queries Big SQL fills the bill. 

 

So if your company is starting to dabble with Hadoop or has full blown production clusters these SQL engines can help you leverage your Db2 skills to start working with Hadoop.   Hopefully this short article will help you kick start your exploration of Big Data!

 

If you would like to get some hands-on experience with Db2 Big SQL please visit IBM’s new demo page: Db2 Big SQL.  You can see demo videos, go through a click-through guided demo, and get actual hands on experience by requesting a live Big SQL/HDP cluster in IBM cloud.  This is a great way to get some experience with Hadoop, Hive,  and  IBM Db2 Big SQL without having to download or install anything.

Views: 470

Add a Comment

You need to be a member of The World of DB2 to add comments!

Join The World of DB2

Comment by blon sin on April 19, 2021 at 14:47

useful

Latest Activity

Jim Reed liked Annette Zawacki's discussion Is your Db2 for z/OS data part of your AI strategy? Learn how you can build AI models anywhere and deploy on IBM Z
yesterday
Jim Reed liked Diego Cardalliaguet's event EMEA IBM Db2 Analytics Accelerator Users Group meeting 2021
yesterday
Jim Reed liked Diego Cardalliaguet's event EMEA IBM Db2 Analytics Accelerator Users Group meeting 2021
yesterday
Jim Reed liked Surekha Parekh's event Achieving Optimal Availability, Usability and Performance with Universal Table Space - Part 2
yesterday
Profile Icon via Twitter
Join the live launch webcast, "What’s New in Db2 AI for z/OS and Watson Machine Learning for z/OS" June 1 at 11 AM… https://t.co/yvz9t7dwcC
TwitterTuesday · Reply · Retweet
A blog post by Surekha Parekh was featured
Tuesday
Diego Cardalliaguet updated an event
Thumbnail

EMEA IBM Db2 Analytics Accelerator Users Group meeting 2021 at Video conference

May 18, 2021 from 10am to 4:15pm
Monday
Annette Zawacki added a discussion to the group Data, Analytics and AI on IBM Z
Saturday
Joe Geller posted an event

TRIDEX Db2 for LUW Q2 2021 Meeting at Virtual Meeting

May 13, 2021 from 9am to 1pm
May 6
Tom Glaser posted events
May 6
Annette Zawacki posted an event
Thumbnail

Is your Db2 for z/OS data part of your AI strategy? at Eastern Time (U.S. and Canada)

May 18, 2021 from 11am to 12pm
May 5
Profile Icon via Twitter
Join “The IBM Enterprise Data & AI Virtual Symposium 2021”https://t.co/uRxOVwMZu6 hear directly from our clients, R… https://t.co/SEGx6axQYf
TwitterApr 27 · Reply · Retweet
Profile Icon via Twitter
Live Now! IBM Enterprise Data & AI Virtual Symposium 2021. Welcome to Ibrahim Parlak from Garanti Bank, over 16 Mil… https://t.co/CtVD3yiMn7
TwitterApr 27 · Reply · Retweet
Profile Icon via Twitter
Join now Panel started “The IBM Enterprise Data & AI Virtual Symposium 2021” https://t.co/uRxOVwMZu6 hear directly… https://t.co/7WN1BLMV4Y
TwitterApr 27 · Reply · Retweet
Profile Icon via Twitter
Join now Panel started “The IBM Enterprise Data & AI Virtual Symposium 2021” https://t.co/uRxOVx4AlE hear directly… https://t.co/7KZmbGpt1O
TwitterApr 27 · Reply · Retweet
Profile Icon via Twitter
Join Now "The road to AI-first enterprises" The IBM Enterprise Data & AI Virtual Symposium. REGISTER NOW… https://t.co/woTAl9ExIQ
TwitterApr 27 · Reply · Retweet

Bringing Db2 enthusiasts together virtually. Expert or novice, distributed or mainframe, this is the place for everything DB2.

Forum

who opened the scoring from the penalty spot

Started by long lin in What's hot ? Apr 22. 0 Replies

fotbollströja barn eget tryck Lionel Messi opened the scoring from the penalty spot, but that was as good as it got for Ronald Koeman's side as Mbappe took over to evoke memories of recent defeats to…Continue

Fact.MR redefines the way

Started by blon sin in What's hot ? Apr 21. 0 Replies

New York, April 20, 2021 - The primary reason for the failure of startups is not lack of funding or intense competition. The main reason is "misunderstanding of market demand."Therefore, for start-ups, it is important to thoroughly study the…Continue

© 2021   Created by Surekha Parekh.   Powered by

Badges  |  Report an Issue  |  Terms of Service