Understand Oracle statistics

Cyrille Modiano

OCP Oracle DBA in Montreal at Desjardins

Latest posts by Cyrille Modiano (see all)

Standard Oracle statistics

Oracle statistics is metadata about oracle objects in the database; it is stored in the data dictionary.

Standard Oracle statistics contains information on tables:

  • Number of rows
  • Average row length
  • Number of data blocks
  • Number of distinct values


This information is used by the optimizer while building the query plan to estimate the cost of a join clause, and equality clause, a full table scan or using the index on a column. The Number of data blocks and the average row length are used to estimate accurately the cost of I/Os.

The column statistics includes information about the number of distinct values (NDV) as well as the maximum and minimum values. This information helps the optimizer understand if an equality clause of a query will be selective or not.

For example if you have a table with 1000 rows and you have 10 distinct values, Oracle assumes that cardinality is 1000/10 (number of rows divided by the number of distinct values) = 100.

But here is a problem, with standard statistics, Oracle always assumes that data is evenly distributed in the table and sometimes it’s wrong.
This assumption can lead the optimizer to make a wrong decision and as a result, a bad execution plan.

This is why oracle added the opportunity to create histograms on columns to store information about the data distribution in column.

Oracle 10g/11g Histograms

There are two types of histograms, frequency histograms and height balanced histograms.
Oracle decides which type of histogram to use depending of the number of distinct values (NDV) in the column.
If the NDV is of 254 or less, then a frequency histogram is used else a height balanced histogram is used.
This limitation is a compromise between accuracy and storage space / time to build the histogram.

Frequency histogram

In a frequency histogram, Oracle stores the number of rows represented by each distinct value in the column.
This histogram helps defining precisely the cost of a join or equality clause for a specific column value.

Oracle Statistics

With this diagram, it is easy to understand why it cost more to join with the value represented by the series 5 than with the value for the series 4.

If the number of distinct values is greater than 254, Oracle decides it’s too expensive in term of time and in term of storage to continue storing the number of rows for each distinct value, so it builds a different type of histogram, the height balanced histogram.

Height Balanced Histogram

In a height balanced histogram Oracle divide the total number of rows of the table into buckets of the same size. If your table as 100 000 rows, oracle will create 100 000 / 254 (maximum NDV) = 394.

For each bucket it stores the endpoint value, the endpoint value is represented by the last column value of the bucket.

Once the division is done, oracle looks for buckets with duplicate endpoints and groups them into the same bucket. This step will help later to estimate the number of rows for each value.

Oracle Statistics

The estimated cardinality is calculated like that:

– If the value indicated in the equality clause is an endpoint value then oracle will use this formula (number of buckets in the endpoint / total number of buckets) * number of rows in the table

– If the value is not an endpoint then oracle use the density (decimal number comprise between 0 and 1, value close to 1 indicate column is unselective, value close to 0 indicate the column is highly selective) to estimate the cardinality : density * number of rows in the table

The density is calculated on the fly by the optimizer using optimizer_dynamic_sampling.

Oracle 12c Histograms

Oracle 12c introduced new types of histograms, the TOP-N Frequency histogram and the Hybrid histogram. Also the Height Balanced histogram is kept as legacy and is not currently supported.

TOP-N Frequency Histogram

Top-N Frequency histograms are an improvement of the previous frequency histograms. The top-n frequency histogram is based on column popularity. To define if a column is popular or not, Oracle check if an endpoint value is common for multiple buckets. If it’s not, the value is characterized as unpopular.  Oracle removes from the histogram the unpopular values to produce a better histogram for popular values. TOP-N frequency histograms are only collected in these conditions:

– The parameter ESTIMATE_PERCENT of the DBMS_STATS package is set to AUTO_SAMPLE_SIZE.
– The number of distinct values is greater than the specified bucket size
– The top-n (n is the bucket size) popular values represents a percentage superior than

(1-(1/n))*100

Hybrid Histograms

A hybrid histogram is a combination of the height balanced histogram and the frequency histograms. This combination enables the optimizer to obtain a better estimation of the selectivity for some queries.

The improvement concern values almost popular, that means values representing more than one bucket but not two buckets.

For example if your bucket size is 100, the first bucket is filled at 100% by the value 501, Oracle stores the rest in a second bucket but fill only 95% of the bucket size, then the endpoint value for the second bucket will be 502 and not 501, so when the bucket will be grouped in the same endpoint value Oracle will have only on bucket for the value 501 and will make an underestimation of the cardinality for that value.

To solve this issue Oracle decided to group all identical values in one bucket and introduced a new stat about the endpoint value, the “endpoint repeat counts” which represents the number of rows represented by the endpoint value in the bucket.

Oracle Statistics

With the repeat counts, the optimizer can determine exactly the number of occurrences of the endpoint value exists.

A hybrid histogram is created when:

– The number of buckets specified (default is 254) is less than the number of distinct values.
– Oracle can’t use a frequency histogram because conditions do not apply.
– AUTO_SAMPLE_SIZE is specified as ESTIMATE_PERCENT value.

These examples demonstrate how the Oracle statistics calculation is essential for good performances.

How to use the dbms_stats package?

Since Oracle 10g, statistics are automatically collected by a maintenance job during a maintenance window. Oracle statistics are not collected automatically on all tables each time the job runs, an algorithm is used to define if enough data as changed in the table to invalidate the statistics.

The default job uses the procedure DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC, it gather statistics only on tables which have been modified significantly (at least 10% of the rows).

Oracle define which tables are the most in need of new stats, then he starts to collect statistics for those tables to be sure the most important stats are gathered during the maintenance window.

This procedure is very similar to the DBMS_STATS.GATHER_DATABASE_STATS with default parameters.

The main dbms_stats parameters

  • Estimate_percent: It represents the percentage of rows oracle will look into to create statistics. The higher the percentage is, the more accurate the optimizer plan will be. You can also use DBMS_STATS.AUTO_SAMPLE_SIZE to let Oracle determine the appropriate sample size (this is the default value)
  • Degree : It represents the degree of parallelism Oracle will use to collect statistics, you can precise a fixed value or use DBMS_STATS.DEFAULT_DEGREE which is based on the initialisation parameter CPU_COUNT or DBMS_STATS.AUTO_DEGREE which define if it is better to use parallelism (DBMS_STATS.DEFAULT_DEGREE) or not depending of the size of the object.
  • Cascade: TRUE or FALSE, define if the stats should be gathered on indexes as well.
  • Options:
    •  GATHER_AUTO : oracle determine automatically which objects need new statistics and gather statistics for these objects (this is the parameter used by the default job since 10g)
    • GATHER_STALE : Oracle only gather stats on objects with stale statistics
    • Method_opt: probably the most important parameter, this parameter defines on which columns the stats will be collected and if oracle will collect histograms on these columns.

Value could be:

– FOR ALL [INDEXED | HIDDEN] COLUMNS [size_clause]
– FOR COLUMNS [size clause] column|attribute [size_clause] [,column|attribute [size_clause]…]

The better way to know if statistics cause wrong plan is to look at the rows estimated by the optimizer and compare this value with the real value.

Let’s take an example and create a table with skewed data:

Now, we insert skewed data :

We have a 72617 rows table with skewed data :

Let’s get the stats without histogram on this table:

Now, we can check if Oracle created a histogram for the column:

In this case the optimizer will assume the data is evenly distributed and we know that’s wrong. Let’s make a test.

Without the histogram Oracle clearly chooses the wrong plan, his estimation is 36309 rows which represent:

Total number of rows in the table / number of distinct values: 72617/2 = 36309

Now what happens when we collect stats with histogram on this column:

Now we see that a frequency histogram is created with 2 buckets which means that oracle collected the number of rows for each distinct value, let’s try again:

How does Oracle decide to collect histogram on a column?

By using the information extracted from sys.col_usage$.  This table contains monitoring usage of predicates on columns in select statements. Here is an example of what you can find in it:

Here we see that oracle keeps information about the columns utilisation. Here the NUMBER_ONE column has been used 6 times in an equality clause.

This means that Oracle can’t know when to create a histogram unless there has been activity on the table.

Advices

For most databases the default job should be fine and you should not modify it unless you encounter performance problems directly related to statistics.

On Oracle 10g

The main problem you can encounter in 10g is when you have table with highly skewed data, because sometimes the DBMS_STATS.AUTO_SAMPLE_SIZE is too low and it does not collect enough data to have a good point of view of the data distribution, this could result in bad cost estimation by the optimizer.
If you notice performance problems on some sql statements, look at the plan and find if the optimizer is making a bad estimation of the rows involved.
If so and if your maintenance window allows it, you should collect stats with an ESTIMATE_PERCENT equal to 100%.

On Oracle 11g

The problem about DBMS_STATS.AUTO_SAMPLE_SIZE on 10g has been solved on 11g.
Oracle introduced a new hash algorithm which allows collecting approximately a 100% sample in a minimum of time.
But be careful, if you use ESTIMATE_PERCENT=>100, the old algorithm is used and it will take much longer, the default of ESTIMATE PERCENT is AUTO_SAMPLE_SIZE and should always be used.

Choosing the best moment

My advice is to leave the default job which does a good work, be sure to adapt the maintenance window to your situation; the hour you choose to collect the stats is determinant in their accuracy and future utilisation.
For example if you choose to collect stats at 22h and:

– job runs at 21h30 and delete data from some tables
– Another batch runs at 2h00 and inserts data in those tables.

You can easily understand that 22h is not the ideal hour to collect Oracle statistics as you don’t have a good point of view of the real state of you tables.

I hope this article was clear and you learnt a bit more on Oracle statistics.

Don’t forget to consult the oracle documentation to gather system statistics:

Guidelines for Gathering System Statistics

and Oracle statistics on fixed objects:

Gathering Statistics for Fixed Objects

Thank you for reading.

2 thoughts on “Understand Oracle statistics

  1. We have 2 node cluster database which is having high inserts and updates frequently here we face high contention during peak hours and Segment Management is auto and Extent Management is local and freelists is set to 1 and freelists group is also set to 1.

    kindly advice me how to change these things for better performance.

    1. Hi Kishore,

      When you say “High Contention”, did you already identified which type of contention it is ? which type of wait event is associated with it ?

      Regards

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.