Java provides functions for handling the basic file processing, which refers to the retrieval of small text files, in a simple, unstructured way. But in handling files requiring structured format, holding data of various formats and having particular requirements, or big files that cannot be entirely loaded into memory, Java code is too complicated and its readability and reusability are hard to be guaranteed.
esProc (free edition is available) can be used to make up for these deficiencies. esProc encapsulates a lot of functions for reading in/writing out and processing structured data, and provides the JDBC interface. A Java application will identify esProc script as a database stored procedure to execute, pass parameters to it and get result set via JDBC. You can learn more from How to Use esProc as the Class Library for Java.
The following cases are those you frequently encounter in retrieving text files in Java, and their esProc solutions.
Retrieving specified fields
You need to import the OrderID, Client and Amount column by their names. Below is the source data:
1. @t means importing the first row as column names. If there are no column names, you can use their sequence numbers to reference columns. To import the first, the second and the fourth column, for example, use file(“D: \\sOrder.txt”).import(#1,#2,#4). The result is as follows:
2. You can also export a computed column. For example, use the following code to combine the year and OrderID into a newOrderID and export it along with Client and Amount:
By default import function reads in all fields. new function creates a two-dimensional table. The result is:
- The default separator is tab, or you can use other separators. To import a CSV file separated with commas, for example, use file(“D: \\sOrder.txt”).import@t(;”,”).
- To export some of the rows, specify them by row numbers. For example, use A1.to(2,100) to export rows from the second to the hundredth; and use A1.to(3,) to export rows beginning from the third one to the end.
- In a few cases, columns of data need to be retrieved and exported as one column. For example, to concatenate OrderID, Client and Amount and export them as one column, you can use the following code after data importing:
Retrieving big files
To retrieve a big file that exceeds the memory capacity, use an esProc cursor, which can be accessed by a Java application with JDBC stream.
- To accelerate the file retrieval, you can use multithreaded parallel processing through @m option. The code is =file(“D: \\sOrder.txt”).cursor@tm(OrderID,Client,Amount). But this approach cannot guarantee that data is retrieved in its original order.
- Sometimes you need to segment data manually before processing it in parallel. To read in a segment of data, use file(“D:\\sOrder.txt”).import@z@t(;,2:24). @z means dividing the file roughly into 24 segments by bytes and importing the second one only. esProc will automatically skip the head row and make up the tail row to ensure that each row is retrieved completely. If the data size in each segment still exceeds the memory capacity, you can replace the import function with cursor function to export data as a cursor.
Retrieving file by column lengths
The following data.txt file does not use the separator:
You need to retrieve the file into a four-column two-dimensional table according to specified column lengths and return it to Java. The id column will have the first three bits, flag column will have the 10th and 11th bits, d1 column will have bits from the 14th to the 24th, and d2 column will have bits from the 25th to 33rd. Thus the four columns in the first row will be 001, DT, 100000000000 and 3210XXXX.
@i means returning a sequence (set) if the file has only one column. Then create a two-dimensional table based on A1; mid function truncates a string and ~ represents each row.
Here’s the result:
Retrieving file containing special characters
The data.csv file contains quotation marks, some of which disrupt the use of the data. So you need to remove the quotation marks before returning the file to Java: Below is the source data:
Here’s the result:
Retrieving the file containing mathematical formulas
In this case, you need to parse the mathematical formulas into expressions, evaluate them and return the results. Below is the source data:
eval function dynamically parses strings into expressions to execute.
The result is:
Retrieving file with multi-line records
In the following file, each record includes three lines. For example, the first record is JFS 3 468.0 2009-08-13 39. Now you need to export the file into a two-dimensional table.
First import the file as a sequence; @s means not splitting the field. Then group the sequence every three members;“#” represents the row number and “\” means integer division. Here’s the result:
If the file is too big to be entirely loaded into memory, you need to use the cursor to retrieve it and perform batch processing. First create a sub.dfx, which responses the external request of data retrieval by retrieving a batch of data and return it. This operation repeats until the whole file is retrieved. Below is the esProc code:
Loop through A1 and retrieve 3,000 rows in each loop. After that you can handle the algorithm the above does. B4 returns B3 to the main script. The main script (which is the dfx file to be called by Java ) is as follows:
pcursor function requests retrieving data through sub.dfx and converts data to a cursor and exports it.
Retrieving records from uncertain lines
With the data.txt file, field values in a record scatter in uncertain number of lines. But fields are fixed. They are “Object Type”, “left”, “top” and “Line Color” and appear repeatedly until the end of the file. The first record, for example, is Symbol1, 14, 11 and RGB( 1 0 0 ). Now you need to retrieve the file into a structured two-dimensional table.
The read function can read in the file as a big string, and then split the string with the separator and remove the first empty line. Finally create a table sequence and use string functions – array, pos, len, mid – to find the desired fields. Note that you should use an if statement to judge the last line, for maybe no carriage return is used there. Here’s the final result:
Besides the string functions, you can use regular expressions to find the desired fields.
To handle a big file that cannot be loaded into memory in one go, use pcursor function to retrieve it in batches.
Retrieving records by marked groups
The data.txt file stores records by groups. Group names are marked by list (such as ARO, BDR, and BSF). You need to combine group names with their field values to form records and export them. Below is the source data:
First import the file into a sequence of strings, and then group the sequence according to lines headed by list. @i will group data into a same group if it satisfies the specified condition. The sign * represents the wildcard character. Here’s A2’s result:
And then retrieve the desired fields and concatenate the records from each group. Here’s the result: