esProc Helps Process Structured Texts in Java – Set Operations

Java doesn’t support set operations directly, so nested loops have to be used to realize the operations of intersection, union, complement and etc. between text files. If there are many text files, or the file to be computed is too big to be loaded into the memory, or it is required to perform set operations according to multiple fields, the code will become even more complicated. However, with the assistance of esProc, which supports set operations directly, Java can realize these operations more easily. Through the following example, you will learn in detail how esProc works.

There are two small files: f1.txt and f2.txt, with their first rows being the column names. You are required to get intersection of the Name fields of the two files. Part of the data is as follows:

File f1.txt:


File f2.txt:


esProc code:


Import the two files into the memory using import function, with tab being the separator by default. Here function option @t means importing the first row as the column names, thus the subsequent computation can use Name and Dept directly to reference the corresponding columns. If the first row is not the column names, then the default column names such as _1 and _2 should be used to make the reference.

After computing, values of A1 and B1 are respectively as follows:


import function can import specified columns. In this case, for instance, only Name field will be computed, so you can merely import this column. The code for importing it is file(“E:\\f1.txt”).import@t(Name).

A2: isect function performs intersection operation. A1.(Name) means drawing out A1’s Name field to create a set, B1.(Name) means drawing out B1’s Name field. The final result of this example is as follows:


A3: The code result A2 means outputting the computed result of A2 to JDBC. You can also get the final result by combining A2 and A3 into one step, with the code being written as result [A1.(Name),B1.(Name)].isect().

The above is the process of intersection operation. For union operation, you just change the function in the code, that is, [A1.(Name),B1.(Name)].union(). The computed result is as follows:


The code for complement operation is [A1.(Name),B1.(Name)].diff(). The computed result is as follows:


The concatenation operation, which is equal to union operation with duplicates retained, is special. The code for it is [A1.(Name),B1.(Name)].conj(). The computed result is as follows:


You can also replace the functions with operators to make the code simpler. Thus the above code for operations of intersection, union, complement and concatenation can be rewritten as:

A1.(Name) ^ B1.(Name)

A1.(Name) & B1.(Name)

A1.(Name) \ B1.(Name)

A1.(Name) | B1.(Name)

Or you can perform set operations on multiple files. For instance, the corresponding variables of f1.txt, f2.txt and f3.txt after they are imported into the memory are A1, B1 and C1 respectively. The code for getting their intersection is as follows:

A1.(Name) ^ B1.(Name) ^C1.(Name) 或 [A1.(Name),B1.(Name),C1.(Name)].isect()

Sometimes the file is so big that it would degrade performance. In this occasion, sort function can be used to sort data first, and then perform set operations using merge function. In this way, the performance will be improved significantly. Function options @i, @u and @d need to be used respectively in performing operations of intersection, union and complement. Their corresponding code is respectively as follows:




merge function can also be used in performing set operations on multiple fields. Suppose different Dept corresponds to the same Name, now it is required to perform intersection operation by taking the two fields as a whole. The code will be written as:


The computed result is as follows:


For files that are too big to be loaded into the memory, cursor function can be used to import them and merge function to perform set operations on them. For example, the code for intersection operation is as follows:

    A2=[ A1.sortx(Name),B1.sortx(Name)].merge@xi(Name) 

Note: cursor function won’t import all the data into the memory at once. It opens a file in the form of cursors (or flow). esProc engine will allocate appropriate buffer for each imported batch of data. The operation thus runs until it is completed.

Different from in-memory computing, the use of cursor requires cursor functions. For instance, sortx function should be used for sorting. Here merge function uses two function options: @i, which means performing intersection operation, and @x, which means the object involved in the computation is cursor instead of in-memory data. Besides, functions like union can only perform set operations on in-memory data; hence they cannot be used to deal with big files.

The above esProc script has completed all data processing. The rest of the work is to integrate the script with Java program through JDBC. Java code is as follows:

    // create a connection using esProc JDBC


         con= DriverManager.getConnection(“jdbc:esproc:local://”);

         // call esProc script, whose name is test

         st =(com.esproc.jdbc.InternalCStatement)con.prepareCall(“call test()”);

         st.execute();//execute esProc stored procedure

         ResultSet set = st.getResultSet();//get the result

File f1.txt:


About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in esProc/R/Python/Perl, Structured Data Process and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s