In the big data computing, besides the grouping and aggregate operations, sometimes you also need to retrieve a group of data each time to analyze. For example, analyze the sales by date, collect statistics on sales curve for each product, and the purchase habit of each client.
In esProc, you can use function cs.fetch(;x) or cs.skip(;x) to get or skip records till the value of expression x is changed. By doing so, a group of consecutive data can be obtained. For example, retrieve a product each time and prepare to examine the sales data of each product:
As we know, that the @z option can be used to retrieve file by block or data from cursor. However, when retrieving by block, esProc will determine how the data is divided, and sometimes you may encounter troubles.
First, let’s prepare a data text: For the above-used data which are already sorted by the sequence number, store them into a new binary file Order_Products:
At this point, you may encounter such problems: For the product number B1445, its sales record appears in both groups. If aggregating after data retrieval each time, then duplicate product numbers may appear in the result returned, and the re-aggregation will be necessary to get the final result. Such piecewise computation is quite common for the parallel computation over big data. The above conditions will make the computation ever more complicated. In this case, we should perform the segmenting by group when storing the data.
When storing the binary data with the cursor, simply use the @g option. In this case, the data written into the cursor will be segmented by group. By doing so, the data from a same group is sure to be fully retrieved all at once when retrieving the data by block. For example:
For the data sorted by the sequence number of products, save them as a binary file Order_Products_G, segment by group according to the PID. This is slightly different to the method we adopted previously to write the data to a file ofOrder_Products. Please note that piece wise storage is only valid for the binary file.
To this point, the circumstances are different to retrieve by section:
At this point, for the data of the segment 1, all product records whose number is B1445 will be read out. As for the data of segment 2, the record will be retrieved from the next product. As can be seen, if the segmenting by group is set to perform during writing a binary file, the data of a whole group will be put in a segment for retrieval from the cursor. With segmenting by group, the integrity of the data in each group can be guaranteed, and the piecewise computation over big data can be simpler and easier.