esProc Assists Java to Retrieve Text Files

Java provides functions for handling the basic file processing, which refers to the retrieval of small text files, in a simple, unstructured way. But in handling files requiring structured format, holding data of various formats and having particular requirements, or big files that cannot be entirely loaded into memory, Java code is quite complicated and its readability and reusability are hard to be guaranteed.

esProc (free version is available) can be used to make up for the these deficiencies. esProc encapsulates a lot of functions for reading in/writing out and processing structured data, and provides the JDBC interface. A Java application will identify esProc script as a database stored procedure to execute, pass parameters to it and get result set via JDBC.

The structure of integration of esProc script and the Java application is as follows:

esProc_java_retrieve_text_1

Now we’ll take the retrieval of specified columns of data as an example to explain how esProc helps Java in retrieving text files. Below is the source data:

esProc_java_retrieve_text_2

To import the OrderID, Client and Amount column, use the following esProc code:

esProc_java_retrieve_text_3

The result:

esProc_java_retrieve_text_4

1. @t means importing the first row as column names. If there are no column names, you can use their sequence numbers to reference columns. To import the first, the second and the fourth column, for example, use file(“D: \\sOrder.txt”).import(#1,#2,#4). The result is as follows:

esProc_java_retrieve_text_5

2. You can also return a computed column. For example, use the following code to combine OrderDate and OrderID into the newOrderID and return it along with Client and Amount:

esProc_java_retrieve_text_6

By default import function reads in all fields. new function creates a two-dimensional table. The result is:

esProc_java_retrieve_text_7

  1. The default separator is tab. Or you can use other separators. To import a CSV file separated with commas, for example, use file(“D: \\sOrder.txt”).import@t(;”,”).
  2. To return some of the rows, specify them by row numbers. For example, use A1.to(2,100) to return rows from the second to the hundredth; and use A1.to(3,) to return rows beginning from the third one.
  3. In a few cases, columns of data need to be retrieved into one column. For example, to concatenate OrderID, Client and Amount and export them as one column, you can use the following code after data importing:

create(all).record(A1.(OrderID)|A1.(Client)|A1.(Amount))

Invocation of esProc script by Java:

The Java main program can invoke esProc script via JDBC, using the following code:

Class.forName(“com.esproc.jdbc.InternalDriver”);

con= DriverManager.getConnection(“jdbc:esproc:local://”);

//Call esProc script (similar to the stored procedure); readfile is the name of this script file

st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call readfile()”);

//Execute the script

st.execute();

//Get the result set

ResultSet rs = st.getResultSet();

……

The return value is JDBC standard ResultSet object. The method of calling an esProc script is the same as that of accessing a database. Programmers can master it fast as long as they are familiar with JDBC.

If the script is as simple as the above, you can write it right into the JDBC calling program, with lines of statements separated by \n. This is analogous to executing a complex SQL statement. It can save you the trouble of saving a script file.

st = (com. esproc.jdbc.InternalCStatement)con.createStatement();

ResultSet rs1 = st.executeQuery(“=file(\”D:\\sOrder.txt\”).import@t(OrderID,Client,Amount)”);

esProc will return the value of the last expression.

For more details about deploying esProc JDBC and calling script through it, see esProc Integration & Application: Java Invocation.

As a specialized class library for structured computing, esProc can handle more than this. Examples will show its capabilities.

Retrieving big files

To retrieve a big file that exceeds the memory capacity, use an esProc cursor. Java application can be accessed via JDBC stream.

esPro code:

esProc_java_retrieve_text_8

JAVA code:

Class.forName(“com.esproc.jdbc.InternalDriver”);

con= DriverManager.getConnection(“jdbc:esproc:local://”);

//Call esProc script (similar to the stored procedure); readbigfile is the name of this script file

st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call readbigfile ()”);

//Set the number of rows in each batch that is retrieved

st.setFetchSize(1000)

//Execute the script

st.execute();

//Get the result set

ResultSet rs = st.getResultSet();

while (rs.next()) {

……

}

  1. To accelerate the file retrieval, you can use multithreaded parallel processing through @m option. The code is =file(“D: \\sOrder.txt”).cursor@tm(OrderID,Client,Amount). But this approach cannot guarantee that data is retrieved in its original order.
  2. Sometimes you need to segment data manually before putting it to be processed in parallel. To read in a segment of data, use file(“D:\\sOrder.txt”).import@z@t(;,2:24). @z means dividing the file roughly into 24 segments by bytes and importing the second one only. esProc will automatically skip the head row and make up the tail row to ensure that each row is retrieved completely. If the data size in each segment still exceeds the memory capacity, you can replace the import function with cursor function to export data as a cursor.

Retrieving file by column lengths

The following data.txt file does not use the separator:

esProc_java_retrieve_text_9

You are required to retrieve the file into a four-column two-dimensional table according to specified column lengths and return it to Java. The id column will have the first three bits, flag column will have the 10th and 11th bits, d1 column will have bits from the 14th to the 24th, and d2 column will have bits from the 25th to 33rd. Thus the four columns in the first row will be 001, DT, 100000000000 and 3210XXXX.

esProc code:

esProc_java_retrieve_text_10

A1: @i means returning a sequence (set) if the file has only one column.

A2: Create a two-dimensional table based on A1. mid function truncates a string and ~ represents each row.

The result is:

esProc_java_retrieve_text_11

Retrieving file containing special characters

The data.csv file contains quotation marks, some of which disrupt the use of the data. So you need to remove the quotation marks and then return the file to Java: Below is the source data:

esProc_java_retrieve_text_12

esProc code:

esProc_java_retrieve_text_13

The result is:

esProc_java_retrieve_text_14

Retrieving file containing mathematical formulas

In this case, you need to parse the mathematical formulas into expressions, evaluate them and then return the results. Below is the source data:

esProc_java_retrieve_text_15

esProc code:

esProc_java_retrieve_text_16

eval function dynamically parses strings into expressions to execute.

The result is:

esProc_java_retrieve_text_17

Retrieving file with multi-line records

In the following file, each record includes three lines. For example, the first record is JFS    3       468.0        2009-08-13 39. Now you need to export the file into a two-dimensional table.

esProc_java_retrieve_text_18

esProc code:

esProc_java_retrieve_text_19

A1: Import the file as a sequence. @s means not splitting the field.

A2: Group the sequence every three members. “#” represents the row number; “\” means integer division. The result is:

esProc_java_retrieve_text_20

A3: Create a table sequence based on A2. ~(1) is the first member of the current group. array function splits a string into a sequence.

esProc_java_retrieve_text_21

If the file is too big to be entirely loaded into memory, you need to use cursor to retrieve it and perform batch processing. First create a sub.dfx, which responses the external request of data retrieval by retrieving a batch of data and return it. This operation repeats until the whole file is retrieved. Below is the esProc code:

esProc_java_retrieve_text_22

A2: Loop through A1 and retrieve 3,000 rows in each loop. The number of rows is the multiple of 3 and the data should be able to be held by memory. The working range of the loop statement is B2-B4, in which A2 is used to reference a loop variable.

B2-B3: Arrange data in the current batch into a two-dimensional table.

B4: Return B3 to the main script.

The main script – which is the dfx file to be called by Java – is as follows:

esProc_java_retrieve_text_23

pcursor function requests retrieving data through sub.dfx and convert data to a cursor and export it.

Retrieving records from uncertain lines

With the data.txt file, field values in a record scatter in uncertain number of lines. But fields are fixed. They are “Object Type”, “left”, “top” and “Line Color” and appear repeatedly until the end of the file. The first record, for example, is Symbol1, 14, 11 and RGB( 1 0 0 ). Now you need to retrieve the file into a structured two-dimensional table.

esProc_java_retrieve_text_24

esProc code:

esProc_java_retrieve_text_25

A1: Read in the file as a big string.

A2: Split the string with the separator and remove the first empty line. The result is:

esProc_java_retrieve_text_26

A3: Create a table sequence and use string functions – array, pos, len, mid – to find the desired fields. Note that you should use an if statement to judge the last line, for maybe no carriage return is used there. The final result is:

esProc_java_retrieve_text_27

Besides the string functions, you can use regular expressions to find the desired fields.

To handle a big file that cannot be loaded into memory in one go, use pcursor function to retrieve it in batches.

Retrieving records by marked groups

The data.txt file stores records by groups. Group names are marked by list (such as ARO, BDR, and BSF). You need to combine group names with their field values and export the records. Below is the source data:

esProc_java_retrieve_text_28

esProc code:

esProc_java_retrieve_text_29

A1: Import the file into a sequence of strings.

A2: Group the sequence according to lines headed by list. @o forbids sorting. The sign * represents the wildcard character. The result is:

esProc_java_retrieve_text_30

A3: Put every two groups in A2 together. The result is:

esProc_java_retrieve_text_31

A4: Form a record by combining data in a group. The result is:

esProc_java_retrieve_text_32

A5: Finally concatenate the records.

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Application, Program Language and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s