esProc Helps Process Structured Texts in Java – Handle Big Files in Groups

There is a type of text files that they are too big to be entirely loaded into the memory, yet as the data have been sorted by a certain column and if they are imported in groups according to this column, they can be all put into the memory for computing. These text files include the call detail record of a telecom company, statistics of visitors on a website, information of members of a shopping mall, etc.

A great deal of complicated code, which is difficult to maintain, is required if Java is used to realize the operation. But with the assistance of esProc in programming, it is easier for Java to deal with this kind of problems. Let’s look at how esProc works through an example.

The text file sOrder.txt has a lot of information of orders, with tab being the separator and the first row being the column names. The data have been sorted by SellerID. Now it is required to import a group of data with the same SellerID at a time, and process each group of data in the same way.

Part of the data of the file sOrder.txt is as follows:

esProc_java_structure_bigfilegroup_1

The esProc script is as follows:

esProc_java_structure_bigfilegroup_2

A1: cursor function opens the file as a cursor. tab is used as the separator by default and function option @t means the first row will be imported as the column names. If only the first four columns are imported and comma is used as the separator, the code should be written as cursor@t(OrderID, Client, SellerId, Amount; ”,”).

A2: for A1 ;SellerId. The code is to fetch data by loop from cursor A1. By fetching a group of data with the same SellerID each time, all the data will be stored in the memory. Please note the for statement here. for cs,n – a way of code writing in esProc – means fetching n records from cursor cs at a time. for cs;x means fetching a group of records from cursor cs at a time, with the same x field in each group and the data having been sorted by x in advance. In this example the data have been sorted, otherwise sortx function can be used to sort them.

Besides being a field, the x in the statement for cs;x can also be an expression. This means rows of data would be fetched each time until expression x changes, such as the line of code for A14 ;floor(SellerId/10), which groups together the data whose SellerID is from 0 to 9 and those whose SellerID is from 10 to 19, and in which floor function means getting the integer part. If each SellerID hasn’t many corresponding records, the above statement can fetch more data at once. Thus the computational performance will be increased.

B3-C3: This is the for statement’s loop body which performs the same operation of data processing on each group. The process of data processing is not our focus here. We’ll design a case like this: Compute the sales amount of each salesperson (SellerID), and append the sales record of the salesperson to the file result.txt if the sales amount is greater than 10,000.

One thing worth noting is that the working scope of for statements can be shown only by the indentation without being marked by braces or begin/end. Moreover, a loop variable can be represented by the name of the cell where for is placed. In this example, that means A2 is the records corresponding to the current SellerID, and A2.sum(Amount)represents performing summing up on the Amount field of this group of records. export function is used to export a group of data to a file and function option @a means the exportation will be realized by appending.

The computed results are saved in file result.txt. Some of the data are as follows:

esProc_java_structure_bigfilegroup_3

The above esProc script has done all the data processing. Then the rest of the work is to integrate the script with Java program through JDBC. Detailed code in Java is as follows:

  //create a connection using eProc JDBC

         Class.forName(“com.esproc.jdbc.InternalDriver”);

         con= DriverManager.getConnection(“jdbc:esproc:local://”);

         //call esProc script, whose name is test

         st =(com.esproc.jdbc.InternalCStatement)con.prepareCall(“call test()”);

         st.execute();//execute the esProc stored procedure

Note: This case doesn’t require returning the computed result to Java program. But sometimes we’ll append the computed result to a cell (say cell B2) and return it to Java program for further processing. In that case, a line of code should be added to the esProc script, like entering result B2 in cell A4, which means outputting the data in B2 to JDBC interface.

Thus the following Java program needs another line of code too to receive the returned result. The code will be written as ResultSet set = st.getResultSet(); after the execute.

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in esProc/R/Python/Perl, Structured Data Process and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s