Group Cursor in esProc

In the big data computing, besides the grouping and aggregate operations, sometimes you also need to retrieve a group of data each time to analyze. For example, analyze the sales by date, collect statistics on sales curve for each product, and the purchase habit of each client.

In esProc, you can use function cs.fetch(;x) or cs.skip(;x)  to get or skip records till the value of expression x is changed. By doing so, a group of consecutive data can be obtained. For example, retrieve a product each time and prepare to examine the sales data of each product:

esProc_group_cursor_1From B7, the records of the 20th goods can be retrieved like this:

esProc_group_cursor_2The data retrieval in esProc cursor is a one-way street. Thus the data in cursor must be in order when retrieving a group of records each time as necessary.

As we know, that the @z option can be used to retrieve file by block or data from cursor. However, when retrieving by block, esProc will determine how the data is divided, and sometimes you may encounter troubles.

First, let’s prepare a data text: For the above-used data which are already sorted by the sequence number, store them into a new binary file Order_Products:

esProc_group_cursor_3In the later computation, if retrieving data by segment, we will get the situation given below:

esProc_group_cursor_4After all data are divided into 100 segments, retrieve the data from the 1st segment in A3, and retrieve the data from 2nd segment in A5, as shown below:

esProc_group_cursor_5At this point, you may encounter such problems: For the product number B1445, its sales record appears in both groups. If aggregating after data retrieval each time, then duplicate product numbers may appear in the result returned, and the re-aggregation will be necessary to get the final result. Such piecewise computation is quite common for the parallel computation over big data. The above conditions will make the computation ever more complicated. In this case, we should perform the segmenting by group when storing the data.

When storing the binary data with the cursor, simply use the @g option. In this case, the data written into the cursor will be segmented by group. By doing so, the data from a same group is sure to be fully retrieved all at once when retrieving the data by block. For example:

esProc_group_cursor_6For the data sorted by the sequence number of products, save them as a binary file Order_Products_G, segment by group according to the PID. This is slightly different to the method we adopted previously to write the data to a file ofOrder_Products. Please note that piece wise storage is only valid for the binary file.

To this point, the circumstances are different to retrieve by section:

esProc_group_cursor_7In this step, the data retrieved in A3 and A5 are as follows:

esProc_group_cursor_8At this point, for the data of the segment 1, all product records whose number is B1445 will be read out. As for the data of segment 2, the record will be retrieved from the next product. As can be seen, if the segmenting by group is set to perform during writing a binary file, the data of a whole group will be put in a segment for retrieval from the cursor. With segmenting by group, the integrity of the data in each group can be guaranteed, and the piecewise computation over big data can be simpler and easier.


About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Unique and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s