esProc Assists Java to Query Big Text Files

Sometimes you need to query a big text file, instead of the database. In those cases, you need to retrieve the file in a stream style to perform the query algorithm, in which parallel processing is required in order to improve the computing performance. Java lacks the class library for doing this. Thus you have to use hardcoding to handle the processing of structured data, and, as a result, the code is complicated, unreadable, and inefficient in performing parallel processing.

You can use esProc (free version is now available) to make up for what Java lacks. Encapsulated with rich functions for reading and writing structured data and cursor function, esProc handles parallel processing with simple code. It also provides the easy-to-use JDBC interface. The Java application can identify an esProc script as the database to execute, pass parameters to it and get the result set via JDBC.

The structure of integration of esProc script and a Java application is as follows:

esProc_java_query_bigtextfile_1

An example will be used to explain the basic process of how esProc helps Java in querying the big text file. Below is the source data:

esProc_java_query_bigtextfile_2

To query orders whose dates are between startDate and endDate and whose amounts are greater than argAmount, use the following code:

esProc_java_query_bigtextfile_3

A1: Open the file with cursor. @t means importing the first row as column names.

A2: Perform structured query and the result is a cursor.

A3: Fetch data from cursor into memory. The result is as follows:

esProc_java_query_bigtextfile_4

The Java main program can call esProc script via JDBC, using the following code:

   Class.forName(“com.esproc.jdbc.InternalDriver”);

                    con= DriverManager.getConnection(“jdbc:esproc:local://”);

                    //Call esProc script (similar to the stored procedure); searchbig is the name of this script file

                    st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call searchbig()”);

                    //Set the parameter

                    st.setObject(1,”2010-01-01″);

                    st.setObject(2,”2010-12-31″);

                    st.setObject(3,2000);

                    //Execute the script

                    st.execute();

                    //Get the result set

                    ResultSet rs = st.getResultSet();

                            ……

The return value is JDBC standard ResultSet object. The method of calling an esProc script is the same as that of accessing a database. Programmers can master it fast as long as they are familiar with JDBC.

If the script is as simple as the above, you can write it right into the JDBC calling program, with lines of statements separated by \n. This is analogous to executing a complex SQL statement. It can save you the trouble of saving a script file.

st = (com. esproc.jdbc.InternalCStatement)con.createStatement();

ResultSet rs1 = st.executeQuery(“=file(\”D:\\sOrder.txt\”).import@t()\n” + “=A1.select(OrderDate>=date(\”2010-01-01\”) && OrderDate<=date(\”2010-12-31\”) && Amount>2000)\n”+=A2.fetch()”);

esProc will return the value of the last expression.

If the memory cannot hold the query result, you can return a cursor directly from the esProc script (i.e. delete A3’s code). You just need to set the number of rows to be fetched in each batch to enable the Java application to fetch data from the result set. The code is as follows: st.setFetchSize(1000) .

For more details about deploying esProc JDBC and calling script through it, see esProc Integration & Application: Java Invocation.

esProc also supports multithreaded parallel processing. The simplest way is using @m option with cursor function in the preceding code. The option means retrieving file with multiple threads.

Or you can segment the file manually to use multiple threads in both data retrieval and data computing. The code is as follows

esProc_java_query_bigtextfile_5

A1: Open the file with 8 cursors, each retrieving a specified part of the file. The sign ~ represents the loop variable, whose value is 1, 2…8 respectively. @z means dividing the file roughly into multiple segments by bytes and retrieving one of them each time. esProc will automatically skip the head row and make up the tail row to ensure that each row is retrieved completely.

A2: Query each of the cursors.

A3: Execute cursor queries in parallel and merge the results. @x means the objects of merge operation are cursors. @m means performing parallel computing. Note that conj function cannot guarantee the order consistency of the records in the result set and the original file.

The preceding code uses the esProc built-in function for parallel processing. In the cases where the algorithm is complicated, or where there are enough memory space for holding the result set, it is better to use explicit parallel statement. The code is as follows:

esProc_java_query_bigtextfile_6

A1: Set the number of parallel threads.

A2: Run your code in parallel. The parallel statement’s working range is the indented B2-B3. to(A1)=[1,2…8] represents the entry parameter for each thread. In the parallel statement, you can use A2 to get the entry parameter; outside it you can use A2 to get the results of all the threads.

B3: Query cursors and fetch the results into memory and return them to the main thread.

A4: Merge the results of threads in sequence.

You can use binary search to increase the query performance of handling sorted data. For example, data has been sorted by Client and OrderID, you need to find corresponding records according to parameters argClient and argOrder. To do this, use the following code:

esProc_java_query_bigtextfile_7

begin and end are beginning position and ending position specified for the binary search. m is the middle position.

B4: Locate the middle position by bytes and retrieve a record with cursor. esProc will automatically skip the head row and make up the tail row to retrieve a complete row. @x means closing the cursor as soon as the record is fetched.

B5-C6: If the locating is successful, store the current record in C5.

B7-C8: If the locating fails, continue to compare the related sets and reset begin and end.

A9: Return C5’s result to JDBC explicitly.

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in esProc/R/Python/Perl, Program Language, Structured Data Process and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s