Histograms for strongly skewed columns

On a recent OTN thread, I learned a nice trick by J. Lewis that allows to circumvent certain problems with histograms.

Histograms were designed to solve the problem of estimating cardinality for skewed columns (i.e. where some values occur much more frequently than the others). For columns with low number of distinct values (NDV) Oracle collects a frequency histogram, which can be thought of as a set of two one-dimensional arrays: one containing all possible values, the other containing their frequency (i.e. how many rows have this value). However, if sample size is small, then Oracle can miss rare values, and they won’t be reflected in the histogram. As a result, the cardinality estimates for those values will be wrong (depending on version Oracle will either set it to either 1 or to half of the frequency for the rarest value found). A detailed explanation of the issues with examples can be found in blog posts by J. Lewis and R. Geist.

The trick is to create a function-based index that hides popular values, thus making sure that rare values are accurately represented in a histogram. Let’s consider an example: a table T with just one column X, which has two possible values, 1 (1M occurences) and 2 (10 occurences).


create table t as
with gen as (select 1 from dual connect by level<=1e3)
select 1 x from gen g1, gen g2;

insert into t select 2 from dual connect by level<=10;

commit;

create index i$t$x on t(x);

exec dbms_stats.gather_table_stats(null, 'T', method_opt=>'for all columns size 254');</pre>
explain plan for select * from t where x=2;

select * from table(dbms_xplan.display());

Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   500K|  1464K|   448   (3)| 00:00:06 |
|*  1 |  TABLE ACCESS FULL| T    |   500K|  1464K|   448   (3)| 00:00:06 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"=2)

As we can see, we get a wrong cardinality with cardinality off by 5 orders of magnitude.

Now let’s create a function-based index that would “hide” the popular value:

create index i$t$x_rare on t(case(x) when 2 then x end);

exec dbms_stats.gather_table_stats('SCOTT', 'T', method_opt=>'for all columns size 254');

and modify the query to use statistics on the virtual column created by the FBI

explain plan for select * from t where case(x) when 2 then x end =2;

select * from table(dbms_xplan.display());

Plan hash value: 1371096493

------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    10 |    40 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T          |    10 |    40 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | I$T$X_RARE |    10 |       |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(CASE "X" WHEN 2 THEN "X" END =2)

Now we are getting a better plan and correct cardinalities.

Of course it is a big disadvantage that we need to rewrite the query to make use of the virtual column, so in some cases it may be a better solution to use another trick suggested by J. Lewis: use “hand-made” histograms that provide a more realistic distribution. Of course, it is also always possible to increase the sample size, making sure that rare values aren’t missed, if the data size is not so big that larger sample size would make calculating histograms prohibitively expensive.

It should be also possible to the virtual column approach for dealing with correlated predicates.

Histograms for strongly skewed columns

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112