Selectivity estimation

User's Guide
PART 4. Database Administration and Advanced Use
CHAPTER 25. Query Optimization

Selectivity estimation

Selectivity is a ratio that measures how frequently a predicate is true.

The selectivity of a predicate measures how often the predicate evaluates to TRUE. Selectivity is defined as the ratio of the number of times the predicate will evaluate to true, to the total number of possible instances that must be tested. Selectivity is most commonly expressed as a percentage. For example, if 2% of employees have the last name Smith, then the selectivity of the following predicate is 2%.

emp_lname = 'Smith'

Selectivity is second only to join enumeration in importance to the process of optimization. Hence, the performance of the optimizer relies heavily on the presence of accurate selectivity information.

Adaptive Server Anywhere can obtain estimates of selectivity from four possible sources. It assumes no correlation between columns of a table and so calculates the selectivity of each column independently.

Column-statistics registry Each time Anywhere performs a query, it saves selectivity information about the data in a column for future reference.
Partial index scans The optimizer can examine the upper levels of an index to obtain a selectivity estimate for a condition on an indexed column.
User-supplied values You can supply selectivity estimates in your SQL statement. If you do so, Anywhere will use them in preference to those from other sources.
Default values If no other source is available, Anywhere can fall back on the built-in default values.

The scan factor is the fraction of pages in a table that need to be read.

The scan factor queries the fraction of pages in a table that needs to be read to compute the result. It is also usually expressed as a percentage. For example, to find the first name of the employee with employee number 123, Anywhere may have to read two index pages and, finally, the name contained in the appropriate row. If there are 1000 pages in the employee table, then the scan factor for this query would be 0.3%, meaning 3 pages out of 1000.

Although the scan factor is frequently small when the selectivity is small, this is not always the case. Consider a request to find all employees who live on Phillip Street. Less than one percent of employees may live on this street, yet, because street names are not indexed, Anywhere can only find the records by examining every row in the employee table.

Column-statistics registry

Adaptive Server Anywhere caches skewed predicate selectivity values and column distribution statistics. It stores this information in the database. Anywhere stores, logs, and checkpoints this information like other data. Adaptive Server Anywhere updates these statistics automatically during query processing.

The optimizer automatically retrieves and uses these cached statistics when processing subsequent queries. This selectivity information is available to all transactions, regardless of the user or connection.

Adaptive Server Anywhere manages the column-statistics registry on a first-in, first-out basis. It is limited in size to 15,000 entries. Anywhere saves the following types of information.

column distribution statistics
LIKE predicate selectivity statistics
equality predicate statistics

Do not give Anywhere amnesia!

You can reset the optimizer statistics using the DROP OPTIMIZER STATISTICS command. If you do so, you erase all the statistics that Anywhere has accumulated.

Caution
Use the DROP OPTIMIZER STATISTICS command only when you have made recent wholesale changes that render previous statistical information invalid. Otherwise, you should avoid this command because it can cause the optimizer to choose very inefficient access plans.

If you erase the statistics, Anywhere must resort to initial guesses about the distribution of your data as though accessing it for the first time. All performance improvements that the statistics could provide will be lost.

Subsequent queries will gradually restore the statistics. In the interim, the performance of many commands can suffer seriously. Consequently, this command rarely improves performance and certainly never provides a long-term solution.

Partial index scans

When cached results are not available to the optimizer, it can decide to probe the directory of an index to estimate the proportion of entries that may satisfy a given predicate. Depending on the predicate and the index, this information may be very accurate.

For example, the optimizer might examine an index of dates to estimate what proportion refer to days before a given date, such as March 3, 1998. To obtain such an estimate, Anywhere examines the upper pages of the index that you have created on that column. It locates the approximate position of the given date, then, from its relative position in the index, estimates the proportion of values that occur before it.

Some cost may be involved in performing such scans because some index pages may need to be retrieved from disk, should they not already be available in the buffer cache. In addition, indices for very large tables, or primary indices for tables pointed to by a large number of foreign keys, may be extremely large. Low fan-out may mean that the optimizer could only obtain specific estimates by examining many pages. To limit this expense, the optimizer examines at most two levels of the index.

Naturally, this method is effective only when the column about which selectivity information is sought is the first column of the index. Should the column comprise the second or more column of the index, the index is of no help because the values will be distributed though out the index.

Similarly, estimates of LIKE selectivity values may be obtained by this method only when the first few letters of the pattern are available. In cases where only the middle or final sections of a word pattern appear, the optimizer must rely on one of the other three sources of selectivity information.

User estimates

Adaptive Server Anywhere allows you, as the user, to supply selectivity estimates of any predicate. These estimates are expressed as a percentage and must be supplied as a floating-point value. You may explicitly state such an estimate for any predicate you choose.

In cases where a user-supplied estimate is available, the optimizer will always use it in preference to an estimate available from any other source. In this situation, it even ignores cached selectivity values for that predicate.

Because the optimizer always uses any explicit estimates you provide, you can use these estimates to guide the optimizer in its choice of access plan.

You should use explicit estimates with care. Estimates in triggers or stored procedures are easily forgotten. Anywhere has no means to update them. For these reasons, all responsibility for their maintenance rests with the author of the procedure or administrator of the database. Should the distribution of data change over time, the values may prove inappropriate and lead the optimizer to choose access plans that are no longer optimal.

Default selectivity estimates

When all else fails and it can obtain estimates from none of the other three sources, the optimizer must fall back on default selectivity estimates.

Anywhere assumes that statistics in the column statistics registry are both present and accurate. For example, if the optimizer is considering a LIKE predicate, it looks in the column-statistics registry. If the registry contains no entry for that predicate, it assumes that none is stored because the selectivity is less than a small threshold value. Since default selectivity estimates are not specific to your data, they can mislead the optimizer into selecting a poor access plan.

When Anywhere executes that plan, it will use the results to save better selectivity estimates in the column statistics registry. If you execute the same query later, it will find these more accurate estimates and adjust the access plan if appropriate. For this reason, performance may be poor the first time or two you execute a particular query on a new database, or after dropping the optimizer statistics.

Anywhere uses the following default selectivities.

Predicate

Default selectivity

Column comparisons: equality to a constant, IS NULL, or LIKE

0.035%
(if not stored in registry)

Column comparisons: inequality to a constant

25%

Other LIKE or EXISTS

50%

Other equalities

5%

Other inequalities

25%

Other IS NULL or BETWEEN

6%

Predicate	Default selectivity
Column comparisons: equality to a constant, IS NULL, or LIKE	0.035% (if not stored in registry)
Column comparisons: inequality to a constant	25%
Other LIKE or EXISTS	50%
Other equalities	5%
Other inequalities	25%
Other IS NULL or BETWEEN	6%

Equijoin selectivity estimation

What is an equijoin?

Frequently, you will need to join two or more tables to obtain the results you need. Equijoins join two tables through equality conditions on one or more columns, as in the case of the following query.

SELECT *
FROM tablea AS a JOIN tableb AS b
   ON a.x = b.y

Join selectivity for equijoins

In the case of equijoins, Anywhere calculates the selectivity of the join based on the cardinality of the individual tables according to the following formula.

selectivity = cardinality(a JOIN b))/(cardinality(a) cardinality(b))

If the join condition involves two columns, then the optimizer uses data distribution estimates from the column statistics registry to estimate the cardinality of the result, and hence the selectivity of the join. Otherwise, if the join condition involves a mathematical expression, the join predicate selectivity estimate defaults to 5%.

Key joins—a rare case where syntax matters

The optimizer takes advantage of joins that are based on foreign key relationships. You can identify these to Adaptive Server Anywhere using the KEY JOIN syntax. When you use this syntax, the optimizer can estimate selectivity accurately using special information contained in the primary index. Anywhere only takes full advantage of these relationships when you explicitly use the KEY JOIN syntax. As such, it is a rare exception to the general rule that Anywhere optimizes your commands based on their semantics, not their syntax. When estimating the selectivity of key joins, the Anywhere optimizer assumes a uniform distribution of the values in the table that contains the foreign key.

Diagnosing and solving selectivity problems

Selectivity estimation problems are the root of most optimization problems. The following sources of information are available to help you.

Displaying estimates and their source

Anywhere can tell you the value of a selectivity estimate and the source of that estimate. You have access to this information through the built-in functions ESTIMATE and ESTIMATE-SOURCE.

The following command displays the selectivity estimate that the optimizer will use for queries which select entries from table T which in which column x contains a value greater than 20.

SELECT ESTIMATE (quantity, 20, '>')
FROM product

Anywhere displays the result of this command as a percentage.

Similarly, the following command displays the source of that estimate. The optimizer can contain estimates from a number of sources, including column statistics registry and user supplied values.

SELECT ESTIMATE_SOURCE (quantity, 20, '>')
FROM product

Solving selectivity problems

If you find that Anywhere is obtaining an incorrect selectivity value from the registry, you can easily reset the value by issuing any command that will perform a complete scan of the table for that condition. For example, the following SQL statement will cause Anywhere to locate all the entries in product table which have quantities equal to 20.

SELECT *
FROM product
WHERE quantity = 20

PLAN> product (seq)

Whenever Anywhere completes execution of a statement such as this one, it automatically updates the column statistics registry based on the results.

You can use a similar tactic to load initial selectivity information into a new database. Simply issue commands that contain conditions that will appear in common statements. When the optimizer later prepares to execute a statement, it will generate a better plan because the correct statistics will be available.

Maintain or remove hard-coded selectivity estimates

Users can supply and hard-code selectivity estimates in triggers or stored procedures while developing the database. Once encoded, the database administrator assumes responsibility for their maintenance as Anywhere has no means to update them automatically.

Unfortunately, these hard-coded values are hidden and may become inaccurate as the information in the database grows and changes. For this reason, you should avoid using them, except in very unusual cases where you can encourage the optimizer to choose a better access plan by no other means. Often, you can avoid using them by priming the database using sample queries as described above.