Dontcheff

Detecting Data Tampering and Measuring Asymmetry and Tailedness of Data in Oracle Database 20c

In DBA, New features, SQL on September 11, 2020 at 11:27

There are seven types of SQL functions in the Oracle database:

1. Single-Row Functions
2. Aggregate Functions
3. Analytic Functions
4. Object Reference Functions
5. Model Functions
6. OLAP Functions
7. Data Cartridge Functions

The single-row functions are the most popular and used ones but the aggregate and analytical function are also extremely popular among Oracle developers (and by DBAs too).

Three new analytical and statistical aggregate functions are now available in Oracle Database 20c. Let us use the RDBMS_BRANDS table for the 3 examples below. For the sake of clarity the last column shows if the database is only available from the cloud.

1. CHECKSUM computes the checksum of the input values or expression and can be applied on a column, a constant, a bind variable, or an expression involving them. All datatypes except ADT and JSON are supported.

In earlier releases you can still use DBMS_SQLHASH.GET_HASH in order to check the integrity of result sets or some other other PL/SQL packages: STANDARD_HASH or DBMS_CRYPTO.

Here is an example of how to use CHECKSUM:

SQL> select DB_ENGINES_RANK,DB_ENGINES_SCORE
from RDBMS_BRANDS
where DB_ENGINES_RANK < 50; 

DB_ENGINES_RANK DB_ENGINES_SCORE
--------------- ----------------
              1          1345.44
              2          1282.64
              3          1078.31
              4           514.81
              5           162.64
              8            90.09
             10            73.89
             13            50.54
             15            27.59
             18            20.27
             42             3.73
             48             2.36

12 rows selected.

SQL> select CHECKSUM(DB_ENGINES_RANK), CHECKSUM(DB_ENGINES_SCORE)
from RDBMS_BRANDS where DB_ENGINES_RANK < 50;  

CHECKSUM(DB_ENGINES_RANK) CHECKSUM(DB_ENGINES_SCORE)
------------------------- --------------------------
                   288250                     209742

Let us now update the rankings of Oracle which in September went up to 1369.36 points and observe how the checksum value for DB_ENGINES_SCORE changed from 209742 to 180002!


SQL> update RDBMS_BRANDS set DB_ENGINES_SCORE=1369.36 
where RDBMS_NAME='ORACLE';

1 row updated.

SQL> commit;

Commit complete.

SQL> select 
CHECKSUM(DB_ENGINES_RANK), CHECKSUM(DB_ENGINES_SCORE)
from RDBMS_BRANDS where DB_ENGINES_RANK < 50;  

CHECKSUM(DB_ENGINES_RANK) CHECKSUM(DB_ENGINES_SCORE)
------------------------- --------------------------
                   288250                     180002


Note: NULL values in CHECKSUM column are ignored. Also, if you rollback the transaction, then the checksum value does not change.

There are 2 other analytical functions which are new to Oracle 20c. Skewness and Kurtosis describes the shape of a probability distribution and there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. Oracle is now offering a very easy way of calculating their values by providing the in-built analytical functions in Oracle 20c.

2. SKEWNESS functions SKEWNESS_POP and SKEWNESS_SAMP are measures of asymmetry in data. A positive skewness is means the data skews to the right of the center point. A negative skewness means the data skews to the left.

Here is an example of how to use SKEWNESS:

SQL> select CLOUD_ONLY, count(*) from RDBMS_BRANDS group by CLOUD_ONLY;

C   COUNT(*)
- ----------
N          9
Y          3

SQL> select CLOUD_ONLY,
SKEWNESS_POP(DB_ENGINES_RANK), SKEWNESS_POP(DB_ENGINES_SCORE)
from RDBMS_BRANDS
group by CLOUD_ONLY; 

C SKEWNESS_POP(DB_ENGINES_RANK) SKEWNESS_POP(DB_ENGINES_SCORE)
- ----------------------------- ------------------------------
N                     2.0503487                     .584150452
Y                    .685667537                      -.4626607
   

Skewness makes sense in the situation where DB_ENGINES_RANK and DB_ENGINES_SCORE represent the database brand rank and score in the list and you want to determine whether the outliers in data are biased towards the top end or the bottom end of the distribution, that is, if there are more values to the top of the mean when compared to the number of values to the bottom of the mean.

3. KURTOSIS functions KURTOSIS_POP and KURTOSIS_SAMP measure the tailedness of a data set where a higher value means more of the variance within the data set is the result of infrequent extreme deviations as opposed to frequent modestly sized deviations.

Here is an example of how to use KURTOSIS:

SQL> select CLOUD_ONLY, count(*) from RDBMS_BRANDS group by CLOUD_ONLY;

C   COUNT(*)
- ----------
N          9
Y          3

SQL> select CLOUD_ONLY,
KURTOSIS_POP(DB_ENGINES_RANK), KURTOSIS_POP(DB_ENGINES_SCORE)
from RDBMS_BRANDS
group by CLOUD_ONLY;  

C KURTOSIS_POP(DB_ENGINES_RANK) KURTOSIS_POP(DB_ENGINES_SCORE)
- ----------------------------- ------------------------------
N                    2.88880521                     -1.4382391
Y                             0                              0
  

Note that a normal distribution has a kurtosis of zero. Have a look at the KURTOSIS_POP for the cloud-only databases Google BigQuery, Amazon Redshift and Snowflake.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: