An Illustration of esProc’s Parallel Processing of Big Text File

esProc can parallelly process big text files conveniently. The following case will illustrate its operating method.

Let’s assume that there is a text file, sales.txt, with ten million sales records. Its main fields include SellerID, OrderDate and Amount. Now compute each salesman’s total Amount of big orders in the past four years.The big orders refer to those whose amount is above 2000.

To do parallel process, first we must segment the file. esProc provides the cursor data object and its function, which segment big text files conveniently and read easily. Take file(“e:/sales.txt”).cursor@tz(;, 3:24) for an example, it means, basically, the file is divided averagely into 24 segments by bytes, then the third one will be read. By doing so, half line of data, or half line of record, will occur, and further programming and processing are required. However, we have to traversal across all the previous rows if segmenting the file by rows, which cannot reach an expected high efficiency by adopting segmenting algorithm and parallel processing. Unlike these two types of solution, esProc can do rounding off job automatically to ensure the data’s validity when segmenting the file.

With esProc at hand, only simple parallel processing will be required after the file is segmented. The code is as follow:

Main Program

esProc_parallism_bigtextfile_1

    A1: Set the number of parallel tasks as 24, that is, divide the file into 24 segments.

    A2: Call the subprogram to do multithreading parallel computation. There are two task parameters: to(A1),A1. The value of to(A1) is [1,2,3…24], which corresponds the serial number of each segment; A1 represents the total segments. When all the tasks is done, the computed results will be uniformly stored in A2.

    A3: Merge the computed result of each task in A2 according to Sellerld.

    A4: Group and summarize the merged results to seek each salesman’s amount.

Subprogram

 

esProc_parallism_bigtextfile_2

Both segment and total, which represent respectively the current segment and total segments, are parameters of the subprogram. Say, the value of segment’s parameter is 3 and that of total is a permanent 24 in task 3.

    A1: Read the file with cursor. which segment should be processed in current task should be decided according to parameter delivered by main program.

    A2: Select those records whose order amount is above 2000 after the year of 2011.

    A3: Group and summarize the filtered data.

    A4: Return the task’s computed result to main program.

Code description

For CPU with N cores, it appears natural to set N tasks. The fact is that some tasks always operate faster than others, say, according to the differences of filtered data. So the common situation is that some cores are in an idle state after finishing those faster tasks, but a few other cores are still in operation with slower tasks. Comparatively, if each core performs multiple tasks in turn, the speed of different tasks will be averaged out and the general operation will become more stable. In the above example, therefore, the task is divided into 24 parts and given to CPU’s 8 cores to process (how many parallel threads are allowed at the same time to configurate in the environment of esProc). But, on the other hand, too many segmented tasks may have weaknesses. One is the decline of whole performance, the other is that there will be more computed results produced by each task when added together, and they will occupy more memory.

callx makes develop process simpler by encapsulating complex multithread computation, which enables programmers to focus on business algorithm instead of being distracted by complicated semaphore control.

Computed results of A3 in the above main program have been sorted automatically according to Sellerld, so it is not necessary to sort again in A4 when grouping and summarizing. @o, the function option of groups, can group and summarize efficiently without sorting.  

Further illustration:

Sometimes when the data size of a text file amount to several TB, it is required to use multiple node based on cluster to make parallel computation. Now esProc can do parallel computation easily. Because its cursor and relative functions support non-expensive scale-out and distributed file system (DFS). As to the above example, all that is needed is a node list added to A2 in the main program. It will be like this: =callx(“sub.dfx”, to(A1),A1; [“192.168.1.200:8281”, “192.168.1.201:8281”, ”……”]).

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Application, Program Language and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s