esProc’s Prime Keys and Their Index Function

In an esProc table sequence, a field or certain fields can be designated as the prime key(s). In this case, some special functions can be employed to make query based on the prime key(s). This can both simplify the code and increase computational performance effectively.

1. find and pfind

Prime keys are common in database tables. The field value of a prime key is solely used to mark a record, so identical values of prime keys are forbidden, which is a mandatory rule in many databases.

UnderesProc, it is assumed that each value of the prime keys is unique forthe records. But no mandatory check will be executed and there won’t be error reporting even if there are identical values. Both T.pfind(v) function and T.find(v) function can be used to query records in table sequence T according to v, value(s) of the prime key(s). find returns the first found record and pfind returns the sequence number of this record.

It is not necessary to set a primary key, or primary keys. When no primary key is set in esProc, the first field will be regarded as the primary key. If we want to set specified fields as primary keys, we use T.primary(Fi,…) function, meaning setting Fi,… as the primary keys of table sequence T.

Usually, we use conventional position function T.select() and T.pselect() to query records in a table sequence. Now we’ll compare usages of this pair of function with pfind and find.

Here EMPLOYEE table in demo database is the table sequence in which query will be executed. Add field FullName to it and use the employees’ full names to make query:

esProc_primekey_index_1

In order to manifest the performance advantages of using prime keys to query data, 10,000 full names will be generated arbitrarily and use pselect and pfind to make query according to these full names. Compute the time consumed by these two methods:

 

esProc_primekey_index_2

Based on the same data, A5 and A9 use pselect and pfind respectively to query positions of the records in the table sequence. In A9, primary function is used to set primary keys for the table sequence before pfind starts to work. A6 and A10 work out respectively the milliseconds consumed as follows:

esProc_primekey_index_3

Though the query results in A5 and A9 are same:

esProc_primekey_index_4

Using similar method, we may compare select function and find function. To keep in line with find function, @1 option is used in select function to get the first result and return it:

esProc_primekey_index_5

A6 and A10 compute respectively the milliseconds consumed as follows:

esProc_primekey_index_6

The query results in A5 and A9 are still same:

esProc_primekey_index_7

It can be seen easily from the comparison that the query functions based on primary keys are muchmore efficient than conventional position functions.

2. Primary keys and the index table

Why will the efficiency increase greatly when primary keys are employed to make query in esProc? The reason is that the index table of the primary keys is used in computing.

In calling a primary function, or making query based on primary keys in a table sequence without an index table for the first time, an index table will be generated according to the primary keys. While an index table is generated, a hash table will be generated according to all values of the primary keys which will be divided into many groups by hash values. These hash values are the corresponding group numbers.

Normally, when querying a certain record in a table sequence according to field value, it needs to examine the records one by one until the target is found. For a table sequence containing n records, an average of n/2 examinations are needed.

Thanks to the index table, it’s turning around in the case of querying a certain record in a table sequence according to field value of the primary key. First, we can compute the hash value according to the primary key value. This enables us to find out directly the corresponding group in the index table. Then we just need to examine records of the same group. In the same way, for a table sequence containing n records, if its primary keys are distributed in k groups according to hash values, we only have to make an average of n/2k comparisons. Using this method, despite the price paid due to computing hash values before generating an index table and executing the query, the number of comparisons is reduced significantly, and especially, it is enough to generate an index table once. Therefore, the more the data in a table sequence and the times needed for query, the higher the efficiency.

During computing, T.index(n) function can be used to create an index table for Ts primary keys in advance. n represents the length of the index table. Default length will be used if there is no particular setting.

What we should know is that the reason that the functions that execute query using values of primary keys, such as find and pfind, can enhance computational performance effectively is because an index table has been created for primary keys in a table sequence. For this reason, if the primary keys themselves can be used as indexes to locate records, it is unnecessary to create an index table. Field EID itself in the above-mentioned EMPLOYEE table, for example, represents the positions of records in the table sequence, thus it is more efficient to use itself to query the corresponding records:

esProc_primekey_index_8

This time 10,000 sequence numbers of employees are generated arbitrarily. A4 queries the corresponding records directly according to these sequence numbers, yet A7 still uses find function to query. A5 and A8 compute respectively the time consumed as follows:

esProc_primekey_index_9

We can see that it is faster to locate records directly using sequence numbers. Because this method doesn’t have to compare field values, nor does it compute the hash values and create an index table. The query results in A4 and A7 are same:

esProc_primekey_index_10

Thus it can be seen that suitability should be taken into consideration in esProc if we are to useindex function of primary keys in a table sequence to increase efficiency.

3. switch function

The role of switch function is to relate the value of a certain field F in table sequence T1 to the primary keys or specified fields of another table sequence T2, thus foreign keys can be created in T1 to reference corresponding records in T2. The operation of creating foreign references is similar to that of using find to query records according to values of primary keys in T2.This process also involves creating an index table. However,T1.run(F=T2.select@1(…))  can also be used to perform this operation. We’ll first compare the efficiency of the two methods:
esProc_primekey_index_11
Both A2 and A3 contain personnel information loaded from binary text file PersonnelInfo:

esProc_primekey_index_12

A4 contains states information:

esProc_primekey_index_13

In both A6 and A9, field State of PersonnelInfo is switched into corresponding states information. Their difference is that A6 uses select@1 function while A9 uses switch function. A7 and A9 compute respectively the time consumed as follows:

esProc_primekey_index_14

After the code in the cellset is executed, values of A2 and A3 are same:

esProc_primekey_index_15

Field State has been switched into corresponding records in states information table.

Before switch is executed, an index table will also be created for corresponding fields in the table sequence. In the example, an index table is created for field ABBR of states information table in A4 to increase matching efficiency. Note that here field ABBR equals a primary key set temporarily. The requirement for the field of a primary key that no identical values are allowed applies to its data too.

So the computational efficiency of switch function is obviously higher than that of cases using run+select@1. Meanwhile, since the relational computation of single foreign key field is relatively common, esProcspecially offers the switch function to simplify the code produced by using functions as well as to increase the computational efficiency effectively.

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Unique and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s