Hurry time is running out register for this live webcast including client interview Register now
Hurry time is running out register for this live webcast including client interview Register now
Don't miss this live webcast on 2nd November 2021 - 11:00AM EST
Tom Ramey will highlight some of the key challenges facing Db2 for z/OS clients and how AI is a breakthrough technology, that when applied to Db2 for z/OS performance management and resiliency can have a major impact. Tom will be joined by Benny Van Straten from Rabobank and Tom Beavin from IBM. Tom Beavin will share Db2 AI use cases and host a live Db2 AI demo.Rabobank is a Dutch multinational banking and financial services company; Benny will share first-hand experiences and lessons learnt around Db2 AI for z/OS and the power of in-database AI.
What will you learn by attending this webcast?
Tom Ramey IBM WW Director, Data and AI on IBM Z
Benny van Straten IT Specialist/DB2 Rabobank
Akiko Hoshikawa IBM Distinguished Engineer
Tom Beavin IBM Db2 AI for z/OS Development Machine Learning and Optimization
I recently wrote about the most common mistakes that are made with respect to RUNSTATS collection in DB2 for z/OS. [[LINK: http://destinationz.org/Mainframe-Solution/Systems-Administration/Make-No-Mistake ]] The responses were positive, but looking into this further, I see that these mistakes are being made more frequently than first thought.
Part of my focus has been on “what NOT to collect,” whereas some customers have been asking for clarification on “what to collect.” Why am I focusing on the opposite of what customers are asking?
The primary reason is because DB2 11 for z/OS (available since 10/2013) already provides recommendations in the catalog table SYSIBM.SYSTATFEEDBACK or explain table DSN_STAT_FEEDBACK on what statistics may benefit your workload. This is based upon query explains, static BIND/REBIND or dynamic SQL prepares, and is therefore specific to your applications. It’s not possible to determine a standard set of RUNSTATs options for all tables and all workloads without knowing what SQL statements are being executed. That’s the benefit of this DB2 11 for z/OS enhancement: it targets the recommendations based upon your SQL.
There’s some work in DB2 11 that’s required to convert the DB2 statistics recommendations into actual RUNSTATS commands. Although the longer goal is for DB2 to simplify this processing.
Therefore, DB2 already has an answer to the question of “what to collect.” Now I will answer “why” I care about “what NOT to collect.”
Identifying Unnecessary Statistics
There’s no process in DB2 today to identify existing statistics that are unnecessary for the optimizer. Additional statistics may increase RUNSTATS collection cost, add extra rows to the catalog and may also increase BIND/prepare time since the optimizer must read in all available statistics to determine what is and isn’t useful for each query.
There are two categories of statistics to call out:
Multi-column FREQVAL: Commonly collected from an index using FREQVAL NUMCOLS n COUNT n
Previously collected statistics that were never re-collected: Often a once-off collection of histogram or other COLGROUP statistics that were trialed to resolve an issue, but never recollected or removed.
Multi-column frequency statistics collected via an index are simple and cheap to collect as a default due to the simplicity of the syntax and efficiency of collection via the index. However, multi-column frequencies are only exploited by the optimizer for equal predicates containing literal values—e.g., if a multi-column frequency is collected on COL1, COL2, COL3, then this would only be utilized by the optimizer if there is a WHERE clause that has equals (‘=’) on all three columns. Given that the majority of static SQL utilize host variables, and dynamic SQL use parameter markers, then the optimizer is unlikely to take advantage of frequency statistics for those. While it’s possible that multi-column frequencies may provide value for some SQL, this likelihood decreases as the number of columns being collected increases, meaning that a two-column frequency is more likely to be exploited than a three-column frequency. Those frequencies on four, five or 10 columns are unlikely to have any SQL that will exploit these.
Regarding “old” statistics: these typically remain because they were initially collected as a trial or a mistake, and it’s misunderstood that they aren’t subsequently removed or overwritten. Maybe you heard that DB2 9 delivered histogram statistics and decided to collect them, but didn’t see any improvement and stopped collecting them. The risk is that these statistics become stale over time, and thus can negatively impact the optimizer by misrepresenting the data at the current point in time.
Since DB2 11 takes steps to automate and simplify what statistics to collect, it’s important to note that this is first built on a foundation of statistics that already exist. This enhancement integrates with the RUNSTATS profiles initially delivered in DB2 10, and the first step to exploitation is typically to execute RUNSTATS with SET PROFILE FROM EXISTING STATS. The problem, as implied above, is that this will inherit all of the unnecessary statistics that currently exist, whether you wanted those statistics or not.
How do you know if you are currently collecting multi-column frequency statistics unnecessarily? It’s easy to determine that you are collecting multi-column frequencies, but not as easy to determine if there are queries that will benefit. But once again, it’s not common that multi-column frequencies are exploited.
The following SQL will identify if you have multi-column frequencies, and also whether those statistics show that the data is skewed (identified by column TOTAL_SKEW) or evenly distributed (TOTAL_EVEN). Why is this interesting? There isn’t value to any frequency statistics that show the data to be evenly distributed, since that is what the optimizer will assume if those statistics were not collected. Therefore, if there are many multi-column frequencies being collected that count as “TOTAL_EVEN” in the below SQL, then you are certainly collecting these unnecessarily.
SELECT NUMCOLUMNS, SUM(TOTAL_EVEN) AS TOTAL_EVEN
,SUM(TOTAL_SKEW) AS TOTAL_SKEW
,SUM(CASE WHEN 1.1/CD2.CARDF >= CD1.FREQUENCYF THEN 1 ELSE 0 END)
,SUM(CASE WHEN 1.1/CD2.CARDF >= CD1.FREQUENCYF THEN 0 ELSE 1 END)
FROM SYSIBM.SYSCOLDIST CD1, SYSIBM.SYSCOLDIST CD2
WHERE CD1.NUMCOLUMNS > 1
AND CD1.TYPE = 'F'
AND CD2.TBOWNER = CD1.TBOWNER
AND CD2.TBNAME = CD1.TBNAME
AND CD2.TYPE = 'C'
AND CD2.NUMCOLUMNS = CD1.NUMCOLUMNS
AND CD2.COLGROUPCOLNO = CD1.COLGROUPCOLNO
GROUP BY CD1.NUMCOLUMNS, CD1.COLGROUPCOLNO) AS X
GROUP BY NUMCOLUMNS
ORDER BY NUMCOLUMNS
The second category I have identified are old statistics, and again there is an SQL below that will allow you to identify these. The target is any histogram, multi-column cardinality (COLGROUP) or multi-column frequencies. I don’t look for single column frequencies only because some customers control collection of these separately from regular statistics collection if the frequencies are not changing over time. The below SQL will find statistics (other than single column frequencies) that don’t correspond to the most recent tablespace or index RUNSTATS execution.
SELECT TYPE, NUMCOLUMNS, TBOWNER, TBNAME, NAME
, MIN(STATSTIME), COUNT(*)
FROM SYSIBM.SYSCOLDIST CD
WHERE STATSTIME < CURRENT TIMESTAMP - 1 MONTH
AND (TYPE IN ('C', 'H') OR NUMCOLUMNS > 1)
AND NOT EXISTS
FROM SYSIBM.SYSINDEXES I
WHERE I.TBCREATOR = CD.TBOWNER
AND I.TBNAME = CD.TBNAME
AND I.STATSTIME = CD.STATSTIME)
AND NOT EXISTS
FROM SYSIBM.SYSTABLES T
WHERE T.CREATOR = CD.TBOWNER
AND T.NAME = CD.TBNAME
AND T.STATSTIME = CD.STATSTIME)
GROUP BY TYPE, NUMCOLUMNS, TBOWNER, TBNAME, NAME
ORDER BY TYPE, NUMCOLUMNS, TBOWNER, TBNAME, NAME
Reduce the Unnecessary
Multi-column frequencies are often being collected by default, but typically provide minimal value to the optimizer. The first SQL above helps identify if your environment has multi-column frequencies, and whether those statistics show that the data is skewed or evenly distributed. If you find that these statistics exist, and this data is evenly distributed, then the decision is easy for stopping collection of these. If the data is skewed, then the decision is not as clear.
However, I would again argue that the optimizer exploits such statistics in minimal scenarios, and this decreases as the number of columns increase. If you’re unsure as to whether you can remove these, then I would suggest removing any frequencies for greater than three columns, and revisit this if any access path regressions surface. If you choose not to continue collection, then it’s important to remove the statistics that you will not recollect (e.g., to remove a five-column frequency, you can use NUMCOLS 5 COUNT 0 in the appropriate RUNSTATS job, or issue a DELETE FROM SYSIBM.SYSCOLDIST with an appropriate WHERE clause to ensure the correct statistics are removed).
Removing old/stale statistics is also critical, and while frequency statistics can be removed by specifying COUNT 0, histogram or COLGROUP (TYPE=’C’) rows must be deleted.
Remember, the goal here is to reduce unnecessary statistics. Otherwise DB2’s exploitation of RUNSTATS profiles will inherit these additional statistics that can increase RUNSTATS collection cost and also impact BIND/PREPARE performance.
Terry Purcell is the lead designer for the DB2 for z/OS Optimizer at IBM's Silicon Valley Lab and is recognized worldwide for his expertise in query optimization and performance with DB2 for z/OS.
Always get the latest news about Db2 for z/OS from the IBM lab! How to subscribe
The DB2 optimizer exploits the catalog information to choose an access path for a given SQL statement, and RUNSTATS is what populates the catalog. It is important therefore to collect a good base of statistics to give the optimizer a fighting chance of choosing well-performing access paths for your SQL.
Here we will discuss some common mistakes that are made when running RUNSTATS.
Mistake #1 – Collecting statistics on indexes and tablespaces at different times.
It is common to collect inline statistics as part of a REORG. When a tablespace is REORG’d, all indexes are also rebuilt, and thus statistics collection will establish a consistent view of the tablespace and its associated indexes. However, Index REORGs may also be scheduled separately to rebuild a disorganized index. For index REORGs, collecting inline statistics can result in inconsistencies if data volumes have changed since the last RUNSTATS on the tablespace or other indexes for that table. Such inconsistencies could impact the access path choice since the optimizer does not attempt to resolve statistics conflicts between related objects.
It is therefore recommended to only execute RUNSTATS (or inline stats) on the tablespace and associated indexes at the same time, and not to collect only on an individual index during REORG.
Mistake #2 – Misunderstanding NUMCOLS specification for index RUNSTATS.
I often see the following usage of NUMCOLS as part of RUNSTATS on an index:
INDEX(ALL) FREQVAL NUMCOLS 5 COUNT 10
This will collect the top 10 most frequently occurring values for the combination of the 1st 5 columns of an index. The mistake however is the assumption that RUNSTATS will cascade the collection of all combinations – such as frequencies on the 1st 4 columns, the 1st 3 columns, the 1st 2 columns, and finally the 1st column. This is incorrect, as you only get what you ask for with this specification. You ask for 5 columns – you get 5 columns.
The solution is to specify NUMCOLS 4, then NUMCOLS3, NUMCOLS2 and NUMCOLS 1. However, this takes us to the next mistake – where such statistics are rarely used.
Mistake #3 – Excessive NUMCOLS collection as a default
For those who understand the syntax for multi-column frequencies, they often use the following default:
RUNSTATS TABLE (ALL) INDEX (ALL) KEYCARD
FREQVAL NUMCOLS 1 COUNT 10
FREQVAL NUMCOLS 2 COUNT 10
FREQVAL NUMCOLS 3 COUNT 10
FREQVAL NUMCOLS 4 COUNT 10
FREQVAL NUMCOLS 5 COUNT 10
This is itself a mistake – simply because multi-column frequency statistics are rarely useful, and thus default collection for all indexes is excessive. While these are efficient for RUNSTATS to collect, they add unnecessary rows to the catalog (SYSIBM.SYSCOLDIST), and increase bind/prepare time since the optimizer must read them in and process these.
For the optimizer to take advantage of multi-column frequencies, the query needs equals predicates with literals on all columns in that group. So the index with COL1, COL2, COL3, COL4, COL5 – you need 5 equals predicates, all with literals (no host variables or parameter markers). This is not common.
There will be some situations where multi-column frequencies are beneficial – but it is rarely worth collecting multi-column frequencies by default and instead collect them if an individual query is identified where they will benefit. If you choose to stop collecting multi-column frequencies, then read on to the next mistake.
Mistake #4 Failing to clean up or recollect “one-off” statistics
When additional frequency, histogram or other colgroup statistics are collected, it is common to cease collection if there was no observed benefit. The question then is – did you remove them? Unfortunately the answer is often NO.
When RUNSTATS executes, it only replaces existing statistics with new values – it does not clear out other statistics collected on that tablespace or index. For example, if you collect NUMCOLS 2 COUNT 10, but the next RUNSTATS execution does NOT specify this clause, then those statistics will remain from their prior collection. This leads to statistics becoming stale, and that can lead to access path issues.
The simple way to determine if there are older (potentially stale) statistics is to query SYSIBM.SYSCOLDIST looking for distinctly different STATSTIMEs for the same table. One exception – since frequency statistics are stored as a percentage of the data, then these only become stale if the stored percentages are no longer representative. It is feasible to collect single column frequencies once, and not recollect them regularly – so be careful cleaning up single column frequency entries (TYPE=’F’ and NUMCOLUMNS=1).
DB2 11 RUNSTATS delivers a new RUNSTATS option – RESET ACCESSPATH – to reset statistics on the objects to -1. This is an effective way to remove old statistics, but ensure you follow up by recollecting statistics with your standard RUNSTATS options. Alternatively, frequency statistics can be removed by specifying COUNT 0 – for example, if you previously collected NUMCOLS 5 COUNT 10, you can remove those specific statistics by using NUMCOLS 5 COUNT 0. This approach does not apply to histogram or other colgroup statistics. Another alternative is to simply issue a DELETE statement to remove the unnecessary rows from SYSIBM.SYSCOLDIST.
Tooling such as IBM Optim Query Workload Tuner can help identify stale or conflicting statistics, and can also identify additional statistics that would benefit your workload. DB2 11 for z/OS also helps identify statistics that are missing based upon your SQL, and also statistics conflicts that exist. As mentioned in mistake #1 and #4, conflicts are often because of statistics being collected at different times – which points to an error in your procedures, or because statistcs were collected and never recollected or deleted. But you shouldn’t wait until they introduce a statistical conflict before you address these.
Did you miss Terry Purcell present the webcast yesterday Title:- "Real Time Analytics Query Performance with DB2 for z/OS" ? Don't worry it is available on replay. Listen to this top preforming webcast anytime, any place using any device. Just Click here to access the recording: http://ow.ly/VCVfY.
We had nearly 800 registrations so this is another record !.
Also just in case you missed the whitepaper Terry wrote you can download this now Improved_query_performance_in_IBM_DB2_11_for_zOS.pdf
Terry Purcell is a Senior Technical Staff Member with the IBM Silicon Valley Lab, where he is lead designer for the DB2 for z/OS Optimizer. Terry has two decades of experience with DB2 in database administration and application development as a customer, consultant, and DB2 developer.
Please, subscribe to get an access.