How esProc Assists Java to Retrieve Text Files

Java provides functions for handling the basic file processing, which refers to the retrieval of small text files, in a simple, unstructured way. But in handling files requiring structured format, holding data of various formats and having particular requirements, or big files that cannot be entirely loaded into memory, Java code is too complicated and its readability and reusability are hard to be guaranteed.

esProc (free edition is available) can be used to make up for these deficiencies. esProc encapsulates a lot of functions for reading in/writing out and processing structured data, and provides the JDBC interface. A Java application will identify esProc script as a database stored procedure to execute, pass parameters to it and get result set via JDBC. You can learn more from How to Use esProc as the Class Library for Java.

The following cases are those you frequently encounter in retrieving text files in Java, and their esProc solutions.

Retrieving specified fields

You need to import the OrderID, Client and Amount column by their names. Below is the source data:

esProc_java_retrieve_text_2

esProc code:

esProc_java_retrieve_text_3

The result:

esProc_java_retrieve_text_4

1. @t means importing the first row as column names. If there are no column names, you can use their sequence numbers to reference columns. To import the first, the second and the fourth column, for example, use file(“D: \\sOrder.txt”).import(#1,#2,#4). The result is as follows:

esProc_java_retrieve_text_5

2. You can also export a computed column. For example, use the following code to combine the year and OrderID into a newOrderID and export it along with Client and Amount:

esProc_java_retrieve_text_6

By default import function reads in all fields. new function creates a two-dimensional table. The result is:

esProc_java_retrieve_text_7

  1. The default separator is tab, or you can use other separators. To import a CSV file separated with commas, for example, use file(“D: \\sOrder.txt”).import@t(;”,”).
  2. To export some of the rows, specify them by row numbers. For example, use A1.to(2,100) to export rows from the second to the hundredth; and use A1.to(3,) to export rows beginning from the third one to the end.
  3. In a few cases, columns of data need to be retrieved and exported as one column. For example, to concatenate OrderID, Client and Amount and export them as one column, you can use the following code after data importing:

create(all).record(A1.(OrderID)|A1.(Client)|A1.(Amount))

Retrieving big files

To retrieve a big file that exceeds the memory capacity, use an esProc cursor, which can be accessed by a Java application with JDBC stream.

esPro code:

esProc_java_retrieve_text_8

  1. To accelerate the file retrieval, you can use multithreaded parallel processing through @m option. The code is =file(“D: \\sOrder.txt”).cursor@tm(OrderID,Client,Amount). But this approach cannot guarantee that data is retrieved in its original order.
  2. Sometimes you need to segment data manually before processing it in parallel. To read in a segment of data, use file(“D:\\sOrder.txt”).import@z@t(;,2:24). @z means dividing the file roughly into 24 segments by bytes and importing the second one only. esProc will automatically skip the head row and make up the tail row to ensure that each row is retrieved completely. If the data size in each segment still exceeds the memory capacity, you can replace the import function with cursor function to export data as a cursor.

Retrieving file by column lengths

The following data.txt file does not use the separator:

esProc_java_retrieve_text_9

You need to retrieve the file into a four-column two-dimensional table according to specified column lengths and return it to Java. The id column will have the first three bits, flag column will have the 10th and 11th bits, d1 column will have bits from the 14th to the 24th, and d2 column will have bits from the 25th to 33rd. Thus the four columns in the first row will be 001, DT, 100000000000 and 3210XXXX.

esProc code:

esProc_java_retrieve_text_10

@i means returning a sequence (set) if the file has only one column. Then create a two-dimensional table based on A1; mid function truncates a string and ~ represents each row.

Here’s the result:

esProc_java_retrieve_text_11

Retrieving file containing special characters

The data.csv file contains quotation marks, some of which disrupt the use of the data. So you need to remove the quotation marks before returning the file to Java: Below is the source data:

esProc_java_retrieve_text_12

esProc code:

esProc_java_retrieve_text_13

Here’s the result:

esProc_java_retrieve_text_14

Retrieving the file containing mathematical formulas

In this case, you need to parse the mathematical formulas into expressions, evaluate them and return the results. Below is the source data:

esProc_java_retrieve_text_15

esProc code:

esProc_java_retrieve_text_16

eval function dynamically parses strings into expressions to execute.

The result is:

esProc_java_retrieve_text_17

Retrieving file with multi-line records

In the following file, each record includes three lines. For example, the first record is JFS    3       468.0        2009-08-13 39. Now you need to export the file into a two-dimensional table.

esProc_java_retrieve_text_18

esProc code:

esProc_java_retrieve_text_19

First import the file as a sequence; @s means not splitting the field. Then group the sequence every three members;“#” represents the row number and “\” means integer division. Here’s the result:

esProc_java_retrieve_text_21

If the file is too big to be entirely loaded into memory, you need to use the cursor to retrieve it and perform batch processing. First create a sub.dfx, which responses the external request of data retrieval by retrieving a batch of data and return it. This operation repeats until the whole file is retrieved. Below is the esProc code:

esProc_java_retrieve_text_22

Loop through A1 and retrieve 3,000 rows in each loop. After that you can handle the algorithm the above does. B4 returns B3 to the main script. The main script (which is the dfx file to be called by Java ) is as follows:

esProc_java_retrieve_text_23

pcursor function requests retrieving data through sub.dfx and converts data to a cursor and exports it.

Retrieving records from uncertain lines

With the data.txt file, field values in a record scatter in uncertain number of lines. But fields are fixed. They are “Object Type”, “left”, “top” and “Line Color” and appear repeatedly until the end of the file. The first record, for example, is Symbol1, 14, 11 and RGB( 1 0 0 ). Now you need to retrieve the file into a structured two-dimensional table.

esProc_java_retrieve_text_24

esProc code:

esProc_java_retrieve_text_25

The read function can read in the file as a big string, and then split the string with the separator and remove the first empty line. Finally create a table sequence and use string functions – array, pos, len, mid – to find the desired fields. Note that you should use an if statement to judge the last line, for maybe no carriage return is used there. Here’s the final result:

esProc_java_retrieve_text_27

Besides the string functions, you can use regular expressions to find the desired fields.

To handle a big file that cannot be loaded into memory in one go, use pcursor function to retrieve it in batches.

Retrieving records by marked groups

The data.txt file stores records by groups. Group names are marked by list (such as ARO, BDR, and BSF). You need to combine group names with their field values to form records and export them. Below is the source data:

esProc_java_retrieve_text_28

esProc code:

2015-09-14_185019

First import the file into a sequence of strings, and then group the sequence according to lines headed by list. @i will group data into a same group if it satisfies the specified condition. The sign * represents the wildcard character. Here’s A2’s result:

esProc_java_retrieve_text_31

And then retrieve the desired fields and concatenate the records from each group. Here’s the result:

esProc_java_retrieve_text_32

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Java Assistant. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s