How to Use esProc to Assist Reporting Tools

Both SQL and Birt scripts can handle the group operation. In SQL, usually we can only group a table automatically according to its own filed(s). When the grouping criterion comes from another table, or is an external parameter or a conditional list, SQL has to handle the grouping in a very roundabout way. Some cases even require the dynamic criteria, which needs to be implemented by writing complicated report scripts. For some other cases, where the grouping criterion doesn’t entirely correspond to the source table (or the intervals involved don’t have any intersection), either the grouping result should be complemented or the difference should be excluded from it. Both scenarios are difficult to deal with in SQL or with the report script.

With support of alignment grouping, enumeration grouping, and order-related computations, esProc can easily implement those scenarios of non-equi-grouping. That’s why it is a best choice for preparing data needed for report creation. Birt/Jasper ,the reporting tool regards an esProc script as the stored procedure, passes parameter to it and gets the returned result after execution through JDBC. You can learn more from How to Use esProc to Assist Reporting Tools.

Here are some typical problems involving non-equi-grouping in SQL development and their solutions in esProc.

Simple fixed grouping

The sales table stores ordering records, in which CLIENT column holds the client names and AMOUNT column has ordering amounts. For the report, we need to group the table according to the list of “potential clients” and to aggregate the AMOUNT column in each group. The list is an external parameter and contains some items that the CLIENT column doesn’t (Their aggregation results should be zero). Now suppose the list is TAS,BON,CHO,ZTOZ . Below is a selection of the source data:

OrderID Client SellerId Amount OrderDate
26 TAS 1 2142.4 2009/8/5
33 DSGC 1 613.2 2009/8/14
84 GC 1 88.5 2009/10/16
133 HU 1 1419.8 2010/12/12
32 JFS 3 468 2009/8/13
39 NR 3 3016 2010/8/21
43 KT 3 2169 2009/8/27

When the grouping criterion is fixed and contains relatively few items, you can have union or decode work with SQL to perform the group. As with this example, the grouping criterion is a dynamic external parameter. We have to create a temporary table, parse each value of the parameter and insert the results into the temporary table before moving on to the subsequent computations. But with esProc, we don’t need to create a temporary table. Here’s the esProc code:

2016-03-09_105320

The align function groups records by arg1, the external parameter list, and uses @a option to get all records from each group; without @a, the function will get the first record of each group. The pjoin function composes each record in order, as shown below:

2016-03-09_105331

Dynamic aggregate by intervals

Suppose we want to segment the sales table according to the ordering amounts and aggregate the amounts for each segment in the report. The criterion of segmentation is a list parameter, which includes multiple items, such as four intervals separated by 0-1000, 1000-2000 and 2000-4000.

With a fixed criterion, we can write it directly in the SQL statement. But with a dynamic external parameter, we often need to compose the SQL statement using a report script, a high-level language such as JAVA. The process could be very complicated. esProc, however, supports the dynamic expression to produce smart code:

2016-03-09_105353

byFac is a parameter, like [“?<=1000″ ,”?>1000 && ?<=2000″,”?>2000 && ?<=4000″,”?>4000″]. The enum function groups records according to the criterion. Here’s the result:

2016-03-09_105405

It happens that the above conditional intervals don’t overlap each other. But in practice it’s not uncommon to have overlapped conditions. For instance, suppose we want to group the ordering amounts according to the following rules:

1000 ~ 4000: Regular orders

Below 2000: Common orders

Above 3000: Important orders

Among these conditions, r1 and r2 overlap each other. Sometimes we don’t want duplicate records in the grouping result (that is, after finding records satisfying r1, then find records satisfying r2 from the rest of the records). Other times we do want the duplicate records (find records satisfying each condition from the whole table). For both scenarios, SQL needs a great amount of with, union, except or minus statements to produce extremely lengthy code. By default, the esProc enum function performs groups without duplicates; but it also allows them by using @r option.

Dynamic accumulation by intervals

The performance table holds the performance scores and performance bonuses of employees. Starting from zero, we divide the performance scores every 10 scores and then, from low to high, aggregate the bonuses cumulatively section by section. That is, the accumulated amount of the current section should cover the bonuses of all previous sections. Below is the source data:

id score bonus
e01 9 800
e02 21 2300
e03 25 2800
e04 33 4100
e05 46 5800
e06 52 6099

Since the performance scores are not fixed, the number of intervals is unfixed too. Oracle and MSSQL are able to implement this by creating an auxiliary interval list and using the window functions, but they produce lengthy code. It’s even harder for databases that don’t support window functions, such as MySQL. esProc is a simpler alternative to deal with this type of non-equi-grouping cases. Below is the esProc code:

2016-03-09_105417

A2 generates intervals dynamically, in which m gets members by their sequence numbers, m(-1) gets the last member and “\” gets the integer part of the quotient. A3 creates a two-dimensional table based on A2 and accumulates bonuses by querying corresponding records in A1. Here’s the result:

2016-03-09_105426

Complementing the source table for fixed grouping

The building table contains records of project completion, in which year is the string type completion time in the format of “year the first half year\the second half year”. For the report, we need to count the number of buildings that have been completed for each type of project during every half year over a period specified by beginning and ending years. Below is the source data:

id type year
1 33 2014 last half
2 33 2014 last half
3 33 2013 first half
4 34 2013 first half

The years in the source table are inconsecutive. It’s not easy for SQL to fill in the missing years and then do the left join. We can use esProc to make the job simpler:

2016-03-09_105436

A2 creates a new two-dimensional table. A3 creates a year list according to the beginning and ending parameters (argb, arge). A4 groups the building table by type and processes each group by loop. Each loop (B4) will insert into A2 a number of records that are the same with the number of items in the year list. Here’s A2’s final result:

2016-03-09_105444

Complement to the source table for fixed grouping and transposition

The onBusiness table stores the records of the employees’ business trips, in which Date and id_user are the main fields. The user table holds the user information, whose main fields are id and name. We want to show in the report whether each employee has a business trip plan for each week in order, with a special requirement that each employee has a column. Below is a selection from the onBusiness table:

Date id_user
2015-06-22 2
2015-06-01 1
2015-06-03 1
2015-06-19 1
2015-06-02 2

Suppose the beginning date and ending date are 2015-05-31 and 2015-06-28, the desired report layout would be like this:

week user1 user2
1 yes yes
2 No No
3 yes No
4 No yes

esProc code:

2016-03-09_105520

First query the desired data using a simple join statement. Next create a two-dimensional table A3 by the intervals, where each week has a row (automatically filling in missing values when dates are interrupted) and each employee has a column. The initial values are “No”. Then loop through A2’s groups to modify corresponding values in A3 to “Yes”.

Inter-group calculation by months

The work table stores information for a job. In the table, People field holds names of workers, Date field has the entry dates and Deleted field holds the quitting dates. We need to summarize from March to July how many people are on the job each month in the report. Below is the source data:

People Date Deleted
Amanda 2015-03-01 Null
Ray 2015-03-01 Null
Moe 2015-04-01 Null
Yan 2015-05-01 Null
Bee 2015-05-05 2015-06-12
Lee 2015-06-06 Null
Jason 2015-05-01 2015-07-03

The problem here is that the months in the source table may be discontinuous, yet we want a consecutive result. So we need to first generate the consecutive months and then perform the alignment grouping on the source table according to the consecutive months. A worker who quits the job will still be included in the on-the-job list of the current month, but will be excluded from that of the next month. To obtain the right number of the workers who are on the job in each month, we need to perform the inter-column calculation and inter-group accumulation.

esProc code:

2016-03-09_105604

The to function generates a consecutive sequence. The new function creates a two-dimensional table based on a sequence (or another two-dimensional table). ~ represents the current member in the original sequence. Here’s the final result:

2016-03-09_105614

Dynamic locating of an interval

The Transaction table records the time points of each transaction for each user. The Discount table holds the discount information after certain time points. There are multiple discount records, forming dynamic time intervals. Now we need to calculate the discount of each transaction for each user for reporting.

Below is a selection of the Transaction table:

TransID Tuser Date
t1 Andrew 2015-06-16 13:13:00
t2 Andrew 2015-06-16 13:15:00
t3 Andrew 2015-06-16 13:17:00
t4 Andrew 2015-06-16 14:15:00
t5 Andrew 2015-06-16 14:18:00
t6 Andrew 2015-06-16 14:25:00
t7 Andrew 2015-06-16 14:35:00
t8 Andrew 2015-06-16 14:55:00
t9 tylor 2015-06-16 13:13:00
t10 tylor 2015-06-16 14:15:00
t11 tylor 2015-06-16 14:55:00

Below is a selection of the Discount table:

DiscountID Date Discount
d1 2015-06-16 13:00:00 30
d2 2015-06-16 14:00:00 25
d3 2015-06-16 14:30:00 20

In SQL, we need to first generate the time intervals through inter-row calculation and then perform the join, which is difficult to implement. Whereas in esProc, we can perform the equi-grouping on Transaction according to Discount, meaning the code is simpler:

2016-03-09_105715

The pseq function calculates the interval number to which a record belongs, and A2(…) gets records by their sequence numbers. Here’s the result:

2016-03-09_105725

In this example all users share the same discount. If each user has its own discount (there is a DUser field in the Discount table), the code should be like this:

2016-03-09_105756

Here’s the result:

2016-03-09_105805

Posted in Reporting tool | Leave a comment

A Standard Method of Doing Inter-row Calculations in Report Creation

Not all reporting tools support inter-row calculations directly. When they don’t you should write scripts to perform them, which is a big hassle, even bigger when multilayer data grouping is involved. esProc can do the calculations with simple code thanks to its support for order-related calculations. So it’s the best choice to prepare the data needed for creating a report in esProc. An esProc script can be identified as a database stored procedure by the reporting tool for execution, accept parameters from it and return result to it through JDBC. See How to Use esProc to Assist Reporting Tools to learn more.

The following examples show inter-row calculations that appear frequently in report development and their esProc solutions.

Link relative ratio and year-on-year comparison

The database table sOrder stores orders per seller per day. The report needs to present results of comparing the sales amount per month during a specified time period with that of the previous month and with that of the same month from the previous year. Below is the source data:

esProc_report_cross_rows_1

Script for calculating link relative ratio:

esProc_report_cross_rows_2

First perform the SQL group and aggregate to get sales amount per month per year, and then calculate link relative ratio using the formula “sales amount of the current month/sales amount of the previous month”. Here’s the result:

esProc_report_cross_rows_3

Script for comparing sales amount between same months on an annual basis:

esProc_report_cross_rows_4

Perform a SQL sort on the grouped data by months and years, and then calculate the rate with the formula “sales amount of the current month/sales amount of the same month from the previous year. Here’s the result:

esProc_report_cross_rows_5

Monthly cumulative sales amount

Calculate the cumulative sales amount per seller per month. The cumulative amount will be cleared at the end of the year.

esProc script:

esProc_report_cross_rows_6

First perform a SQL group and aggregate to calculate the sales amount per seller per month in one year, and then perform the inter-row calculation with the formula “accumulated sales amount of the current month = sales amount of the current month + the accumulated sales amount of the previous month”. Here’s the result:

esProc_report_cross_rows_7

Balance of every term for one account

data.csv stores information of deposit into and withdrawal from a bank account whose initial balance is 45. Now you need to calculate the balance of every term according to this file. Below is the source data:

esProc_report_cross_rows_8

esProc script:

esProc_report_cross_rows_9

Import the file with comma used as the separator, and calculate the current balance using the formula “previous balance + current deposit – current withdrawal”. Here’s the result:

esProc_report_cross_rows_10

Daily inventory of multiple products

The database table inout stores warehouse-in and -out records per product per day during continuous days and the initial inventory is zero. You need to display the data in a report and add an extra column to represent the stock per day.

esProc_report_cross_rows_11

esProc script:

esProc_report_cross_rows_12

Calculate current day’s stock for each kind of product (through data grouping) with the formula “current day’s warehouse-in – current day’s warehouse-out + warehouse stock of the previous day”. esProc uses [-1] to represent the previous record and produces intuitive expression. Here is the result:

esProc_report_cross_rows_13

For discontinuous days, you need to align data with a sequence of consecutive dates before calculating the inventory. The script is as follows:

esProc_report_cross_rows_14

Installment loan

The loan table stores loan information, including total loan amount, payment terms by month and annual interest rate. You need to build a grouped report in which details of each term of payment for every loan record – payment, interest, principal, principal balance – will be presented. Below is the source data:

esProc_report_cross_rows_15

esProc script:

esProc_report_cross_rows_16

First calculate the monthly interest rate and each term’s payment for every loan using a formula; then create a two-dimensional detail table for each loan according to the available information and calculate the current term’s interest, principal and principal balance; finally concatenate these detail tables. Below is the detail table for loan 1:

esProc_report_cross_rows_17

Calculating inter-row ratios irregularly

The database table majorSum stores the number of patients in a hospital’s every key department and the total patient number in the hospital. You need to present the ratio of the patient number in each department to that in the whole hospital, and keep the total patient number of the hospital at the end. Below is the source data:

esProc_report_cross_rows_18

The desired layout:

department patient
Accident and emergency (A&E) 7.03%
Anaesthetics 12.37%
Breast screening 12.37%
Discharge lounge 14.61%
Ear nose and throat 15.06%
Haematology 17.43%
Neurology 5.23%
Cardinal Community Hospital 44562

esProc script:

esProc_report_cross_rows_19

First find the record of total patient number of the hospital (which has the largest patient number) using maxp function; use “\” to perform a difference operation to get the records of patient numbers in departments and calculate the required ratio; and finally use “|” to concatenate the newly generated records. Here’s the result:

esProc_report_cross_rows_20

Posted in Application, Reporting tool | Leave a comment

Report Building Challenges: In-depth Processing of JSON Data

For processing semi-structured JSON data, both Java and the reporting tools provide class libraries that are only capable of simple parsing jobs, giving you no choice but to write complicated code when trying to process the data in-depthly. However, using esProc to prepare the data needed for report building can help you out. The code is quite simple. A reporting tool will execute an esProc script in the same way as it executes a database stored procedure, pass parameters to the script and get the returned esProc result through JDBC. See How to use esProc to Assist Reporting Tools for details.

Below are examples of a range of most common problems you may encounter in presenting JSON data with reporting tools, and their esProc solutions.

Grouping and aggregating JSON data

order.json stores order records. You need to calculate the total sales amount a client contributes per month in a specified time period. Below is a selection of the source data:

esProc_report_depth_jason_1

esProc script:

esProc_report_depth_jason_2

Import the JSON file as a two-dimensional table; perform a conditional query and then the group and aggregate. Both argBegin and argEnd are report parameters. Here’s the result:

esProc_report_depth_jason_3

Presenting cell phone logs in a list table

cellphone.log is a file storing cell phone logs in JSON format. In it, general and ping fields are subdocuments containing one record each, and cap_main field, a subdocument too, contains multiple records. You need to present cap_main field in a list table, with each record being appended by general’s mtmodel and networktype fields and ping’s ssid field.

Below is the source data:

esProc_report_depth_jason_4

esProc script:

esProc_report_depth_jason_5

After esProc imports the JSON file, it will retain the file’s multilayer structure. At the same time, it can use periods to access fields in different layers of data. The derive function will add computed columns to the table. Here’s the result:

esProc_report_depth_jason_6

Querying the multi-level subdocument in a JSON file

jsonstr.json has a subdocument, runners field, which has three fields – horseId, ownerColours and trainer which contains a subfield – trainerId. The report needs to present the horseId, ownerColours and trainerId fields in each subdocument filed according to serial numbers of the documents.

Below is the source data:

esProc_report_depth_jason_7

esProc script:

esProc_report_depth_jason_8

Read in the JSON file; retrieve runners field (the subdocument) according to the serial numbers of documents; and get the desired fields, in which trainerId comes from a subdocument of runners. A3 is the data the report needs:

esProc_report_depth_jason_9

The grouped report from a multi-level JSON file

Cells.json is a multi-level nested JSON file, which you want to present in a grouped report. The grouping fields are name, type and image.”xlink:href”. There is also a field with 3 subdocuments: custom. Identifier, custom. Classifier and custom. Output, which are of the same structure but contain different number of documents each.

Below is the source data:

esProc_report_depth_jason_10

esProc script:

esProc_report_depth_jason_11

esProc merges the three subdocuments into a single two-dimensional table, gives them a new field name ctype to be identified and appends the grouping fields to the tabled. Thus a typical “table with subtables” is created.

esProc_report_depth_jason_12

Now it’s easy to build a grouped report based on this esProc result.

A report with subreports using different JSON files

You might want to create a report containing multiple subreports, where the main report and the subreports use different JSON files as their sources. Below is a selection of the source data

esProc_report_depth_jason_13

A reporting tool supporting only a single data source, such as Jasper and BIRT, would need to combine the multiple sources into one using JAVA classes, while esProc would use a simple script as follows:

esProc_report_depth_jason_14

Read in the JSON file and get the first field, which is represented by “.#1”. By assigning different file names to the parameter argFileName, the report will receive different data sets, as the following shows:

esProc_report_depth_jason_15

Joining a txt file and a JSON file

structure.txt is a structured text file separated by tabs. json.txt contains unstructured JSON strings. There is a foreign key relationship between the second field of structure.txt and part of the json.txt. Below are selections from them:

structure.txt

esProc_report_depth_jason_16

Json.txt

esProc_report_depth_jason_17

You need to build a report to present the above relationship, with a layout like this:

esProc_report_depth_jason_18

esProc script:

esProc_report_depth_jason_19

Read in the JSON file, retrieve the desired fields and subdocument to create a two-dimensional table and append a computed column to it. Below is A3’s result:

esProc_report_depth_jason_20

Then import the text file, perform a join between it and A3 and retrieve the desired fields to generate the final result A6, as shown below:

esProc_report_depth_jason_21

Posted in Reporting tool | Tagged , , | Leave a comment

Performing Group Operations on Text-based Tabular Data in JAVA

The group operations performed on tabular data generated from text files include algorithms like grouping and aggregation, obtaining distinct values, group merging and so on, which can be realized using basic JAVA class libraries. But JAVA provides only limited support for the structured-data computing, generating complicated and inefficient code for knotty cases.

Yet it’s easy to perform group operations on tables imported from text files by using esProc as the JAVA class library for handling structured data. The ways esProc works include independent operation and receiving invocation from either the command line or a JAVA application. More information can be found in the article How esProc Implements Text Processing.

Here follows some typical scenarios about group operations operated on text-based tabular data and offers their solutions in esProc.

Simple grouping and aggregation

The sales.csv file stores sales orders. Requirement: Group data by SellerId to calculate the total sales amount per seller and write the results to a new file. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

esProc code:

A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.groups(SellerId;sum(Amount))
3 =file(“e:\\result.txt”).export@t(A2)

Explanation: First import the source file as a structured two-dimensional table; then perform the group and aggregate.

By default The import function imports all fields using tab as the separator; @t means importing the first row as the field names. The groups function can perform group and aggregate on a two-dimensional table. The export function writes the two-dimensional table to a file. To write the tabular data to another data table having the same structure, use the code OracleDB.update(A2,Tsales).

Here’s the result:

esProc_at03a_1

Extended information:

  1. If the source data has been ordered by SellerId, we could use @o to speed up the data handling by only comparing the adjacent records with the code =A1.groups@o(SellerId;sum(Amount)).
  2. esProc allows for grouping data according to multiple fields and the implementation of various aggregate algorithms. To calculate the sales amount and count the orders for each seller per year, for example, we could use this code:groups(SellerId,year(OrderDate);sum(Amount),count(OrderID)).

Here’s the result:

esProc_at03a_2

  1. When the source file is too big to be entirely loaded into the memory but the aggregate result would be smaller (this is what usually happens), we can import the file as the cursor and perform the aggregate with the group function, thus the code will be file(“D:\\sales.txt”).cursor@t().groups(SellerId;sum(Amount)).
  2. If the aggregate result set isn’t small enough to be completely put into the memory, we could carry out the aggregation using groupx Unlike the group function, groupx returns a cursor.

Duplicate-related operations

This scenario is about removing the duplicate rows in the Sales.csv file, that is, for two rows with the same OrderID, only keep the first one. Below is a selection of the source file:

OrderID   Client        SellerId     Amount    OrderDate

1       WVF Vip 5       440.0        2009-02-05

1       WVF Vip 5       1040.0      2009-02-04

1       UFS Com          5       1040.0      2009-02-03

2       UFS Com  13     1863.4      2009-07-05

2       UFS Com  13     1863.4      2009-07-05

3       SWFR        2       1813.0      2009-07-08

4       JFS Pep     27     670.8        2009-07-08

5     DSG       15    3730.0    2009-07-09……

esProc code:

A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.group@1(OrderID)

Explanation: Group the imported data by OrderID and get the first record of each group without aggregation. The group function is able to group data without performing the aggregate; @1 means getting the first row of each group and returning the results as a new two-dimensional table.

Here’s the result:

esProc_at03a_3

Extended information:

  1. It’s a more practical practice to retain records according to a certain condition. To keep the record with the earliest date in each group, use the following code: group(OrderID).(~.maxp(-OrderDate)).

Here’s the result:

esProc_at03a_4

To find the first of each group of records sorted by OrderDate in ascending order and by Amount in descending order, we can find the records with earliest OrderDate and then find the one from them with the greatest amount. The code is A1.group(OrderID).(~.maxp([-OrderDate,Amount])).

If there are multiple records satisfying the condition and we want them all, use A1.group(OrderID).conj(~.maxp@a([-OrderDate,Amount])) .

To get the top two from the multiple eligible records, use

A1.group(OrderID).conj(~.top(2,OrderDate,-Amount)).

  1. Sometimes it is the duplicate records that we want, then get them with group(OrderID).select(~.len()>1).conj().

Here’s the result:

esProc_at03a_5

  1. Sometimes we simply want the distinct values of a certain field. To get the client list, for instance, use id(Client). The result:
  2. esProc_at03a_6

The id function performs the distinct operation, so it is the equivalent of A1.group@1(Client).(Client).

  1. Other times we want all the rows that have different field values, instead of filtering the records by ID. In those cases we do it with

A1.group@1(OrderID,Client,SellerId,Amount,OrderDate)

The result:

esProc_at03a_7

Without intending further structured-data computation, we can import the file as a set of rows and perform distinct operation using file(“D:sales.csv”).import@tsi().id(). This is what we get:

esProc_at03a_8

Concatenating strings by groups

The sales.csv file stores sales orders, based on which we want a table of client lists having three fields – year, month and comma-separated client list. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

esProc code:

A
1 =file(“D:\\sales.cvs”).import@t()
2 =A1.group(year(OrderDate),month(OrderDate);~.(Client).string())

Explanation: Group the imported data by the two parts of OrderDate, get the set of clients in each group and convert them to comma-separated strings. Here’s the result:

esProc_at03a_9

Intra-group alignment

The empContact.csv file has six columns, the first four of which contain employee information where duplicate values exist; the fifth column holds the ways of contact and the last column is the specific contact information. Each employee has at least one way for contact. Now we want to merge the records to make a new table where the fifth, sixth and seventh columns are work phone, cell phone and work email. Below is source file:

21458952,John,technology,support staff,work phone,555-555-5555

21458952,John,technology,support staff,work email,johndoe@whatever.net

21458952,John,technology,support staff,cell phone,233-1132-9762

99946133,Jane,technology,administration,work phone,444-444-4444

99946133,Jane,technology,administration,work email,janepaul@whatever.net

99946133,Jane,technology,administration,cell phone,121-3433-9788

99946133,Jane,technology,administration,home phone,431-443-8434

esProc code:

A
1 =file(“D:\\empContact.csv”).import(;”,”)
2 =A1.group(#1,#2,#3,#4;~.align([“work phone”,”cell phone”,”work email”],#5):g)
3 =A2.new(#1,#2,#3,#4,g(1).#6,g(2).#6,g(3).#6)

Explanation: Group the imported records by the first four columns and align records in each group according to the order of work phone\cell phone\work email (the duplicate values will be automatically filtered away); then build a new two-dimensional table by sequentially retrieving the records. Here’s the result:
esProc_at03a_10

Grouping by sequence numbers

In the following file, every three lines correspond to one record. For example, the first record is 26\TAS\2009-08-05. We want to rearrange it into a two-dimensional table. Here’s the source data:

26

TAS

2009-08-05

33

DSGC

2009-08-14

84

GC

2009-10-16

esProc code:

A
1 =file(“D:\\data.txt”).import@si()
2 =A1.group((#-1)\3)
3 =A2.new(~(1):OrderId,~(2):Client,~(3):OrderDate)

Explanation: Import the file as a sequence, during which @s means that fields won’t be extracted; group the sequence every three lines, during which “#” represents the sequence number of each line and “\” represents integer division; finally create a table sequence based on the groups, during which ~(1) is the first member of the current group.

The result:

esProc_at03a_11

Extended information: If the file has a complicated format. For example, there are three fields in the second line of each group:

26

TAS   1       2142.4

2009-08-05

33

DSGC        1       613.2

2009-08-14

84

GC    1       88.5

2009-10-16

In this case, we just need to modify A3 into this : =A2.new(~(1):OrderID, (line=~(2).array(“\t”))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate )

The result:

esProc_at03a_12

Intra-group order-related calculation

postTurn.txt records the schedule of rotating shifts for a post. The file is grouped by dates and some persons work consecutive shifts. We want to find out how many times the two neighboring names work successive shifts. The fields would be first, next and times. Below is the source file:

#2015-06-01

Ashley

Matthew

#2015-06-02

Ashley

Matthew

Matthew

#2015-06-03

Ashley

Ashley

Matthew

Ryan

#2015-06-04

Ashley

Ryan

Ryan

Matthew

#2015-06-05

Jessica

#2015-06-06

Ashley

Jessica

Matthew

Ashley

Jessica

Jessica

Ryan

esProc code:

A
1 =file(“D:\\postTurn.txt.txt”).import@i()
2 =A1.group@i(pos(~,”#”))
3 =A2.conj(~.([~[-1],~]).to(3,))
4 =A3.group(~).new(~(1)(1):first,~(1)(2):next,count(~):times)

Explanation: Group the imported data according to the condition that whether a row contains “#” or not, and each group is a day’s work shift schedule; for each group, put every two neighboring names into a subset and concatenate all the subsets; group each concatenated set every two members, get the first name and the next name from each group and count how many times they appear together. Here’s the result:

esProc_at03a_13

Inserting data between groups

In the sales.txt file, the value of SellerId changes every multiple rows. We want to export the OrderId field and add the string “Begin” every time a group of records with new value begins and the string “End” after a group ends. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

esProc code:

A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.group@o(SellerId).conj(“Begin”|~.(OrderID)|”End”)

Explanation: Write the code directly according to what it asks. @o enables grouping by comparing the adjacent records. To retrieve a certain field from a group of records, use ~.(OrderID). The operator “|” is used to concatenate sets or members of a set. The conj function handles subsets respectively and then concatenates the results.

Here’s the result:

esProc_at03a_14

Extended information:

  1. Stop using @o when data is unordered, but the performance will decrease a little.
  2. To add an aggregate value, the number of records in each group, for example, write A2 as group@o(SellerId).conj((“Begin: “+string(~.count()))|~.(OrderID)|”End”)

Here’s the result:

esProc_at03a_15

  1. To add strings to both ends of a group of records and to append them to a new file in order, use the following code:
A B
1 =file(“D:\\sales.csv”).import@t() =file(“d:\\result.txt”)
2 for A1.group@o(SellerId) =B2.export@a([“begin”])
3 =B2.export@a(A2)
4 =B2.export@a([“End”])

The result:

begin

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

End

begin

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

End

begin

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

End

begin

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

End

Sampling by groups

The sales.txt file stores ordering data, and we want to retrieve one record from each seller’s records. Here’s the rule: Find out records whose OrderDates are fewer than 60 days after their previous ones, and retrieve one randomly from these sampled records; if there is an insufficient number of records satisfying the condition, just get the record that has the smallest OrderDate. Besides, it is specified that the second record will always appear in each group. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

144  WZ   6       86.4 2010-12-23

70     DSG 7       288.0        2009-09-30

131  FOL  7       103.2        2009-12-10

65     YZ     8       29600.0   2009-01-06

esProc code:

A B C
1 =file(“D:\\sales.csv”).import@t()
2 for A1.group(SellerId) =A2.sort(OrderDate).select(interval(OrderDate[-1],OrderDate)<60)
3 if B2!=[] =B1=B1|B2(rand(B2.len())+1)
4 else =B1=B1|B2(1)

Explanation:

Group the imported records by SellerId and loop through the groups. For each loop, first sort the group by OrderDate and then take the sample as required. If there’re enough records sampled, retrieve one randomly and append it to B1; if the number of records meeting the condition is insufficient, then get the first record of this group. OrderDate[-1] represents the previous record relative to the current one. Below are the results of three sampling:

esProc_at03a_16

Data grouping, splitting and export

The sales.txt file stores a large number of sales orders, which we want to divide into multiple files according to both the year and the month. The format of the file name will be “year-month.txt”. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

esProc code:

A
1 =file(“D:\\ sales.txt”).import@t()
2 =A1.group(string(OrderDate,”yyyy-MM”);~)
3 =A2.run(file(“d:\\temp\\”+#1+”.txt”).export(#2))

Explanation: Group the imported file by the year and the month, parse the data in each group; loop through each group to write it to the file. For instance, the following is the contents of the 2009-01.txt file:

65     YZ     8       29600.0   2009-01-06

62     JAXE          11     8134.0      2009-01-06

64     HP    13     20000.0   2009-01-02

60     PWQ         16     3430.0      2009-01-05

63     SJCH         16     5880.0      2009-01-02

61     SJCH         19     1078.0      2009-01-08

Extended information: When the size of the source data exceeds the memory capacity, we should import the file with the cursor function. If each group of data is still too big to be accommodated by the memory, we can use the groupx function to group the cursor file. The coding structures for both cases are the same.

Handling a big file by groups

The sales.txt file is too big to be held by the memory, thus we need to filter out the records of the clients whose total order amount is greater than 10,000, and write them to another file. The source data is already ordered by SellerId, as shown below:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

esProc code:

A B C
1 =file(“D:\\sales.csv”).cursor@t()
2 for A1;SellerId if A2.sum(Amount)>=10000 =file(“e:\\result.txt”).export@at(A2)

Explanation: Loop through the cursor by SellerId and import a group of data into the memory at a time; sum up the amounts for the current group and write the records to the new file if the sum satisfies the condition. Here’s the result:

esProc_at03a_17

If the final result set is relatively small, we can append it to a cell (B1, for instance). In that case C2 should be B1=B1|A2.

If the source data isn’t ordered, we can first sort it using the sortx function, which returns a cursor, too.

Multilevel grouping

The Stock.txt file keeps the information of goods in and out of the warehouse. A same kind of goods may be found in and out several times a day, or there may be nothing in and out during continuous days. The initial inventory of the goods is 0. We use In to represent goods shipped in, and Out for those shipped out. Now we want to calculate the daily inventory for all kinds of goods. Below is the source file:

date name        quantity   flag

2014-04-01       Item1       15     In

2014-04-01       Item1       4       In

2014-04-02       Item1       3       In

2014-04-02       Item1       10     Out

2014-04-03       Item1       3       In

2014-04-04       Item1       5       Out

2014-04-07       Item1       4       In

2014-04-10       Item1       2       Out

2014-04-01       Item2       20     In

2014-04-02       Item3       30     In

2014-04-03       Item3       14     Out

esProc code:

A B
1 =file(“D:\\stock.txt”).cursor@t()
2 =A1.group(name,date;~.select(flag==”In”).sum(quantity):in,~.select(flag==”Out”).sum(quantity):out)
3 =A2.group(name) =periods((t=A2.id(date)).min(),t.max(),1)
4 for A3 =A4.align(B3,date)
5 >c=0
6 =B4.new(A4.name:name,
B3(#):date,
c:Opening,
in,
(b=c+in):Total,
out,
(c=b-out):Close)
7 =@|B5

Explanation: A2 calculates the daily in-and-out quantity for each kind of goods; obtain the complete list of dates according to the earliest and latest dates and store it in B3; group records by items, loop through each group to align the current group of records with B3 and calculate the daily inventory in order. Here’s the result:

esProc_at03a_18

 

Posted in Java Assistant | Leave a comment

esProc Simplifies SQL-style Computations – Dynamically Transposing Rows and Columns

In real-world business, many computing tasks require transposing rows and columns dynamically. There are a lot of discussions around the operation in online IT groups and forums. Below lists some of them:

http://www.dbforums.com/showthread.php?1628876-SQL-question-how-to-move-two-column-data-to-one-column-data

https://www.linkedin.com/grp/post/137774-5981739289539526658?trk=groups-post-b-title

http://stackoverflow.com/questions/29756836/mysql-pivot-table-and-numbered-columns

http://www.linkedin.com/groups/need-help-86080.S.5796995228699222017?trk=groups_items_see_more-0-b-ttl

SQL implements row-to-column transposition in these ways:

  1. Using functions for transposing rows and columns

Oracle 11g and above versions, as well as MSSQL2005+, provide row/column transposition operators – pivot and unpivot – for switching rows to columns and columns to rows. They require specifying specific targeted columns and thus cannot handle scenarios with dynamic columns straightforwardly.

  1. Using CASE expression

For some databases that don’t support pivot, like MySQL and DB2, you can use the conditional expression case when to handle the transposition. Similar to pivot, case when also requires the fixed targeted rows. You cannot code dynamically switching rows to columns in a straightforward way.

To deal with transposition to dynamic columns, the only option is:

  1. Composing dynamic SQL

To transpose rows to dynamic columns, you need to compose dynamic SQL statements in the stored procedure. The ways the coding is actually done and their levels of difficulty vary according to different databases. Therefore it is impossible to write universal SQL statements for the computation.

In real-world cases, the transposition from rows to columns is often accompanied by inter-column calculations. This further complicates the problem.

Usually the aim of transposing rows to columns is to further display data. That means a main program (like the reporting tool) will receive the transposing result and use it to proceed to the next operation. Use Raqsoft esProc (free version is available) to help to handle the transposition if the main program is in Java. esProc script is written for dynamic interpretation and execution, and, therefore, more universal for coding row/column transposition. esProc provides JDBC interface that acts as the middleware between Java application and the database to let Java application execute esProc script as it accesses a database, without changing the application structure.

A simple example will be used to illustrate how esProc handles the row-to-column transposition and integrates with Java main program.

1. Simple row-to-column transposition

Generally, a row-to-column transposition operation simply transposes rows of data to columns of data, without involving the complex inter-column calculations. For example, transposing the student score table into sets where data is displayed by subjects:

esProc_sql_dynamic_transpose_1

The targeted result:

esProc_sql_dynamic_transpose_2

esProc script for implementing the task:

esProc_sql_dynamic_transpose_3

A1: Execute SQL to retrieve data, and sort data by ID and SUBJECT.

A2-A3: Group data by ID and SUBJECT. esProc retains the grouping result – the subsets – for later use.

A4: Create a desired, dynamic empty result set.

A5-B5: Loop through A2’s student groups, and write student IDs, names and scores of subjects into the empty result set according to A3’s grouping result.

A6: Return the result set.

Basic steps for transposing rows to columns with esProc: Generate the empty targeted result set (A4); then compute and append every row of data to the result set (A5, B5). With the stepwise computing mechanism supporting data table object, you can code the transposition from rows to columns in a natural thinking pattern.

The result set of esProc script can be returned to Java main program or the reporting tool via JDBC interface. Below is the code for Java to call the esProc script:

Class.forName(“com.esproc.jdbc.InternalDriver”);

                    con= DriverManager.getConnection(“jdbc:esproc:local://”);

                    // Call esProc script (which is similar to the stored procedure); the script file name is p1.

                    st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call p1 ()”);

                        // Execute the script

                    st.execute();

                    // Get the result set

                    ResultSet rs = st.getResultSet();

                     ……

The returned value is a JDBC standard ResultSet object. The way of calling esProc script is the same as that of accessing a database. Programmers who are familiar with JDBC can master it fast.

About deploying esProc JDBC and calling esProc script in it, see esProc Integration & Application: Java Invocation.

2 Transposing rows to dynamic columns

In the preceding example, column names (values of SUBJECT) can be determined directly. In that case, it is not so difficult to code the transposition problem using static syntax like pivot (or case when). If they need calculations to be determined dynamically, it is hard to handle the problem with pivot. For example, manufacturing workshops manufacture different kinds and numbers of product.

esProc_sql_dynamic_transpose_4

You need to determine the number of resulting columns according to the one with the greatest length among groups divided by code. The targeted result:

esProc_sql_dynamic_transpose_5

esProc script for implementing the task:

esProc_sql_dynamic_transpose_6

A1: Execute SQL to retrieve data from the output table.

A2: Group data by code. esProc retains the grouping result (members in every group) for use in later computations.

A3: Calculate the maximum number of members among the groups, so as to determine the number of columns in the result set.

A4-A5: Create an empty, dynamic result set.

A6-B7: Loop through A2’s grouping results, and write product and amount of each group into A5’s result table sequence.

Similar to the preceding script, this script first generates an empty, dynamic result set, and then calculates desired data and appends it to the result set.

This computational task requires writing dynamic SQL statements to compose the result set. As the number of columns can only be determined by getting the group with the greatest members, and, unlike the simple pivot operation, field values cannot be directly used as the column names in composing a result set, the most effective way is writing stored procedure step by step.

Compared with the complicated stored procedure programming, esProc produces more concise code in a more easily way with step-by-step computing model.

3 Row-to-column transposition involving inter-column calculations

As mentioned at the beginning of the article, transposing rows and columns often involves inter-column calculations. Below is such an example:

esProc_sql_dynamic_transpose_7

Output the payment of each month in the specified year (say, 2014). If a month lacks the related data, the payment for this month is the same as that for the previous month.

The targeted result:

esProc_sql_dynamic_transpose_8

esProc script for implementing the task:

esProc_sql_dynamic_transpose_9

A1: Execute SQL to retrieve data of the specified year.

A2: Create an empty table sequence with 12 months for the result set.

A3: Group the data by customer name.

A4-B7: Loop through A3’s groups to calculate. B5 specifies the payment for the current month; B6 specifies null as the payment value of the previous month; and B7 inserts the resulting records into the empty table sequence.

Likewise, the implementation first creates an empty result set and then appends data to it. Difference is that the appended data is got through a series of calculations.

esProc supports order-related computing, so it is easy for it to reference the value of the previous record. Compared with complex SQL approach or stored procedure programming, esProc script is clearer and easier to understand in performing inter-column calculations for dynamic row-to-column transposition.

4 Column-to-row transposition

On top of the transposition scenarios mentioned in the above, there are others requiring switching multiple columns in one row to multiple rows (column-to-row transposition). The following source data has unfixed number of columns:

esProc_sql_dynamic_transpose_10

The targeted result:

esProc_sql_dynamic_transpose_11

esProc script for implementing the task:

esProc_sql_dynamic_transpose_12

A1: Execute SQL to retrieve data.

A2: Create an empty table sequence for the targeted result.

A3: Calculate the number of rows into which each record is to be split, based on the number of A1’s columns.

A4-B4: Loop through A1’s data sets to dynamically get data from each column and insert it into A2’s resulting table sequence.

 

Posted in SQL-related Puzzle | Tagged , , , , | Leave a comment

How esProc Assists Writing SQL Queries

esProc can simplify complex SQL-style queries using ordered sets, object-style access and stepwise computation. Usually there are two ways in which esProc is used: an independent one and one requiring integration with Java. Now let’s look at the first one through an example.

The sales table stores several years of order data, based on which you need to calculate the link relative ratio of each month’s sales amount in a specified time period. Below is a selection of source data:

esProc_sql_application_structure_1

Step 1: Connect to the database in esProc IDE

esProc_sql_application_structure_2

esProc allows connecting to multiple data sources at one time to perform hybrid computation. In this example MySQL is used to configure JDBC, as shown below:

esProc_sql_application_structure_3

Step 2: Implement the algorithm in esProc IDE

esProc_sql_application_structure_4

First retrieve data from the database according to a time period, in which begin and end are query parameters; then group data by years and months, and aggregate Amount; finally add a new field Irr (the link relative ratio by month) to A2. In the new field expression, mAmount represents sales amount of the current month and mAmount[-1] represents that of the previous month.

Step 3: Execute the script

The script execution requires assigning values to the parameters. The interface is as follows:

esProc_sql_application_structure_5

Click “OK” to complete the computation. Then you can see result of each step by clicking the corresponding cell. For instance, below is A2’s result:

esProc_sql_application_structure_6

A3 stores the final result, as shown below:

esProc_sql_application_structure_7

Note: Apart from executing all cells sequentially as the above shows, esProc provides a shortcut key for executing the current cell.

For a script that is already finished and stored, you can execute it using the command line or batch processing, without the need of running it via IDE every time.

Using the command line to execute scripts

You can use esprocx.exe to execute scripts, in a form like esprocx -R d:\Query1.dfx 2012-06-01 2014-06-30. -R option can display the computing result, the last cell by default. This is shown below:

esProc_sql_application_structure_8

See esProc Integration & Application: Command Line for details.

Besides being used alone, sometimes the result of an esProc script requires invocation by Java.

Java invocation

The Java main program can invoke the esProc script via JDBC, using the following code:

Class.forName(“com.esproc.jdbc.InternalDriver”);

con= DriverManager.getConnection(“jdbc:esproc:local://”);

// call esProc script (which is similar to the stored procedure); Query1 is the name of the dfx file

st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call Query1 (?,?)”);

java.util.Date  dateBegin  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-01-01”);

java.sql.Date  sqlDateBegin  =  new java.sql.Date(dateBegin.getTime());

java.util.Date  dateEnd  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-12-31”);

java.sql.Date  sqlDateEnd  =  new java.sql.Date(dateEnd  .getTime());

st.setDate(1, sqlDateBegin);

st.setDate(2, sqlDateEnd );

// Execute the script

st.execute();

// Get the result set

ResultSet rs = st.getResultSet();

……

esProc will return the last calculation cell by default, or it will return a certain cell using the return statement. The returned result is a ResultSet object in accordance with JDBC standards. The method of calling an esProc script is the same as that of accessing a database. Programmers who are familiar with JDBC can master it quickly.

The reporting tools that support JDBC interface can invoke an esProc script as well. For details, see How to use esProc to assist Reporting tools

The above is the normal way of integrating an esProc script by Java. Now let’s move on to look at the special cases:

Simple script without a file

A simple esProc script can be written directly in the Java application, without the need of creating a script file. Note that when SQL is used, prepareStatementment uses the quotation mark as the parameter placeholder. But as the quotation mark is specially reserved for use in esProc expressions, you need to use the form of “arg1,arg2,arg3” to hold places sequentially. Here’s the code:

ResultSet rs1 =(com. esproc.jdbc.InternalCStatement)con.prepareStatement(“$select * from sales3 where OrderDate>=? and OrderDate<=?;arg1,arg2\n =A1.groups(year(OrderDate):y,month(OrderDate):m;sum(Amount):mAmount)\n =A2.derive(mAmount/mAmount[-1]:lrr)”);

java.util.Date  dateBegin  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-01-01”);

java.sql.Date  sqlDateBegin  =  new java.sql.Date(dateBegin.getTime());

java.util.Date  dateEnd  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-12-31”);

java.sql.Date  sqlDateEnd  =  new java.sql.Date(dateEnd  .getTime());

st.setDate(1, sqlDateBegin);

st.setDate(2, sqlDateEnd );

st.execute()

Writing data to another database

It is not necessarily that a result set needs to be returned. For instance, if you want to update the Oracle database with A3, here’s the code:

esProc_sql_application_structure_9

myDB1 and orDB1 are different data sources. But with update function, you can update the summaryTable table in orDB1 with A3’s table sequence. In the table, the fields Year and Month correspond A3’s y and m respectively; and mAmount and Irr fields have counterparts of the same names in A3.

In this case, the code of integration by Java doesn’t need st.getResultSet() to get the result.

Big result set

Sometimes the computing result is too big to be held in the memory. In that case you can use an esProc cursor function to return it, and use JDBC flow to access Java. For example, tb1 is the big table in MySQL database, and tb2 is another big one in Oracle database. Both have the same structure but they store different data. Now you need to find records from them with the same IDs.

esProc code:

esProc_sql_application_structure_10

The cursor function can execute a SQL statement and return a cursor. The merge function can merge ordered data; its @x means merging cursors and @i means returning an intersection of the cursors.

A2 also returns a cursor, from which the Java main program cannot fetch data at one time. So you need to perform a batch fetching with the following code:

st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call bigMerge ()”);

st.setFetchSize(1000);//Set the number of records retrieved in each batch

st.execute();

ResultSet rs = st.getResultSet();

while (rs.next()) {

……

}

Posted in SQL-related Puzzle | Tagged , , | Leave a comment

How esProc Assists Java to Retrieve Text Files

Java provides functions for handling the basic file processing, which refers to the retrieval of small text files, in a simple, unstructured way. But in handling files requiring structured format, holding data of various formats and having particular requirements, or big files that cannot be entirely loaded into memory, Java code is too complicated and its readability and reusability are hard to be guaranteed.

esProc (free edition is available) can be used to make up for these deficiencies. esProc encapsulates a lot of functions for reading in/writing out and processing structured data, and provides the JDBC interface. A Java application will identify esProc script as a database stored procedure to execute, pass parameters to it and get result set via JDBC. You can learn more from How to Use esProc as the Class Library for Java.

The following cases are those you frequently encounter in retrieving text files in Java, and their esProc solutions.

Retrieving specified fields

You need to import the OrderID, Client and Amount column by their names. Below is the source data:

esProc_java_retrieve_text_2

esProc code:

esProc_java_retrieve_text_3

The result:

esProc_java_retrieve_text_4

1. @t means importing the first row as column names. If there are no column names, you can use their sequence numbers to reference columns. To import the first, the second and the fourth column, for example, use file(“D: \\sOrder.txt”).import(#1,#2,#4). The result is as follows:

esProc_java_retrieve_text_5

2. You can also export a computed column. For example, use the following code to combine the year and OrderID into a newOrderID and export it along with Client and Amount:

esProc_java_retrieve_text_6

By default import function reads in all fields. new function creates a two-dimensional table. The result is:

esProc_java_retrieve_text_7

  1. The default separator is tab, or you can use other separators. To import a CSV file separated with commas, for example, use file(“D: \\sOrder.txt”).import@t(;”,”).
  2. To export some of the rows, specify them by row numbers. For example, use A1.to(2,100) to export rows from the second to the hundredth; and use A1.to(3,) to export rows beginning from the third one to the end.
  3. In a few cases, columns of data need to be retrieved and exported as one column. For example, to concatenate OrderID, Client and Amount and export them as one column, you can use the following code after data importing:

create(all).record(A1.(OrderID)|A1.(Client)|A1.(Amount))

Retrieving big files

To retrieve a big file that exceeds the memory capacity, use an esProc cursor, which can be accessed by a Java application with JDBC stream.

esPro code:

esProc_java_retrieve_text_8

  1. To accelerate the file retrieval, you can use multithreaded parallel processing through @m option. The code is =file(“D: \\sOrder.txt”).cursor@tm(OrderID,Client,Amount). But this approach cannot guarantee that data is retrieved in its original order.
  2. Sometimes you need to segment data manually before processing it in parallel. To read in a segment of data, use file(“D:\\sOrder.txt”).import@z@t(;,2:24). @z means dividing the file roughly into 24 segments by bytes and importing the second one only. esProc will automatically skip the head row and make up the tail row to ensure that each row is retrieved completely. If the data size in each segment still exceeds the memory capacity, you can replace the import function with cursor function to export data as a cursor.

Retrieving file by column lengths

The following data.txt file does not use the separator:

esProc_java_retrieve_text_9

You need to retrieve the file into a four-column two-dimensional table according to specified column lengths and return it to Java. The id column will have the first three bits, flag column will have the 10th and 11th bits, d1 column will have bits from the 14th to the 24th, and d2 column will have bits from the 25th to 33rd. Thus the four columns in the first row will be 001, DT, 100000000000 and 3210XXXX.

esProc code:

esProc_java_retrieve_text_10

@i means returning a sequence (set) if the file has only one column. Then create a two-dimensional table based on A1; mid function truncates a string and ~ represents each row.

Here’s the result:

esProc_java_retrieve_text_11

Retrieving file containing special characters

The data.csv file contains quotation marks, some of which disrupt the use of the data. So you need to remove the quotation marks before returning the file to Java: Below is the source data:

esProc_java_retrieve_text_12

esProc code:

esProc_java_retrieve_text_13

Here’s the result:

esProc_java_retrieve_text_14

Retrieving the file containing mathematical formulas

In this case, you need to parse the mathematical formulas into expressions, evaluate them and return the results. Below is the source data:

esProc_java_retrieve_text_15

esProc code:

esProc_java_retrieve_text_16

eval function dynamically parses strings into expressions to execute.

The result is:

esProc_java_retrieve_text_17

Retrieving file with multi-line records

In the following file, each record includes three lines. For example, the first record is JFS    3       468.0        2009-08-13 39. Now you need to export the file into a two-dimensional table.

esProc_java_retrieve_text_18

esProc code:

esProc_java_retrieve_text_19

First import the file as a sequence; @s means not splitting the field. Then group the sequence every three members;“#” represents the row number and “\” means integer division. Here’s the result:

esProc_java_retrieve_text_21

If the file is too big to be entirely loaded into memory, you need to use the cursor to retrieve it and perform batch processing. First create a sub.dfx, which responses the external request of data retrieval by retrieving a batch of data and return it. This operation repeats until the whole file is retrieved. Below is the esProc code:

esProc_java_retrieve_text_22

Loop through A1 and retrieve 3,000 rows in each loop. After that you can handle the algorithm the above does. B4 returns B3 to the main script. The main script (which is the dfx file to be called by Java ) is as follows:

esProc_java_retrieve_text_23

pcursor function requests retrieving data through sub.dfx and converts data to a cursor and exports it.

Retrieving records from uncertain lines

With the data.txt file, field values in a record scatter in uncertain number of lines. But fields are fixed. They are “Object Type”, “left”, “top” and “Line Color” and appear repeatedly until the end of the file. The first record, for example, is Symbol1, 14, 11 and RGB( 1 0 0 ). Now you need to retrieve the file into a structured two-dimensional table.

esProc_java_retrieve_text_24

esProc code:

esProc_java_retrieve_text_25

The read function can read in the file as a big string, and then split the string with the separator and remove the first empty line. Finally create a table sequence and use string functions – array, pos, len, mid – to find the desired fields. Note that you should use an if statement to judge the last line, for maybe no carriage return is used there. Here’s the final result:

esProc_java_retrieve_text_27

Besides the string functions, you can use regular expressions to find the desired fields.

To handle a big file that cannot be loaded into memory in one go, use pcursor function to retrieve it in batches.

Retrieving records by marked groups

The data.txt file stores records by groups. Group names are marked by list (such as ARO, BDR, and BSF). You need to combine group names with their field values to form records and export them. Below is the source data:

esProc_java_retrieve_text_28

esProc code:

2015-09-14_185019

First import the file into a sequence of strings, and then group the sequence according to lines headed by list. @i will group data into a same group if it satisfies the specified condition. The sign * represents the wildcard character. Here’s A2’s result:

esProc_java_retrieve_text_31

And then retrieve the desired fields and concatenate the records from each group. Here’s the result:

esProc_java_retrieve_text_32

Posted in Java Assistant | Leave a comment

How esProc Assists Java to Query Big Text Files

Sometimes you need to query a big text file, instead of the database. In those cases, you need to retrieve the file in a stream style to perform the query algorithm, in which parallel processing is needed to improve the computing performance. As Java lacks the class library for doing these, you have to use hardcoding to handle the processing of structured data, and, as a result, the code is complicated, unreadable, and inefficient in performing parallel processing.

You can use esProc (free edition is now available) to make up for what Java lacks. Encapsulated with rich functions for reading and writing structured data and cursor functions, esProc handles parallel processing with simple code. It also provides the easy-to-use JDBC interface. The Java application can identify an esProc script as the database to execute, pass parameters to it and get the result set via JDBC. You can learn more details from How to Use esProc as the Class Library for Java.

Here’s an example for explaining the process of how esProc helps Java in querying the big text file. Below is the source data:

esProc_java_query_bigtextfile_2

To query orders whose dates are between startDate and endDate and whose amounts are greater than argAmount, use the following code:

esProc_java_query_bigtextfile_3

Open the file as a cursor with the cursor function; @t means importing the first row as column names. Then perform structured query, and fetch data from the query result into the memory if the result is not big. The result is as follows:

esProc_java_query_bigtextfile_4

If the memory cannot hold the query result, you can return a cursor directly from the esProc script (i.e. delete A3’s code). And then the Java application can fetch data from the returned cursor in JDBC stream style.

esProc also supports multithreaded parallel processing. The simplest way is using @m option with cursor function in the preceding code. The option means retrieving file with multiple threads.

Or you can segment the file manually to use multiple threads in both data retrieval and data computing. The code is as follows:

esProc_java_query_bigtextfile_5

It opens the file with 8 cursors, each retrieving a specified part of the file. @z means dividing the file roughly into multiple segments by bytes and retrieving one of them each time. esProc will automatically skip the head row and make up the tail row to ensure that each row is retrieved completely.

The conj function can merge the results. @x means the objects of merge operation are cursors; @m means performing parallel computing. Note that the function cannot guarantee the order consistency of the records in the result set and the original file.

The preceding code uses the esProc built-in function for parallel processing. If the algorithm is complicated and there is enough memory space for holding the result set, it is better to use explicit parallel statement. The code is as follows:

esProc_java_query_bigtextfile_6

It uses 8 parallel threads to retrieve and process the big file, and each return the result to the main program after the query is done. The fork function works to execute these threads. Its working range is B2-B3, within which you can use A2 to get the entry parameter and outside which you can use A2 to get the results of all the threads.

For ordered data, you can use binary search to increase the query performance. For example, data has been sorted by Client and OrderID, you need to find corresponding records according to parameters argClient and argOrder. To do this, use the following code:

esProc_java_query_bigtextfile_7

begin and end are beginning position and ending position specified for the binary search. m is the middle position.

Locate the middle position by bytes and retrieve a record with cursor. esProc will automatically skip the head row and make up the tail row to retrieve a complete row. If the locating is successful, store the current record in C5. If it fails, continue to compare the related sets and reset begin and end.

Posted in Java Assistant | Tagged , , | Leave a comment

How esProc Assists Java in Handling JSON Data

Java can handle simple parsing of semi-structured JSON data, but has difficulty in handling the in-depth processing. esProc, which supports set operations, order-related operations and dynamic script execution, can help reduce the difficulty. A Java application will execute an esProc script in the same way as it executes a database stored procedure, pass parameters to the script and get the returned esProc result through JDBC. For more details, see How to Use esProc as the Class Library for Java.

Below are examples of the most common problems you may encounter in handling JSON data in Java, and their esProc solutions.

Grouping and aggregating JSON data

order.json stores order records. You need to calculate the total sales amount a client contributes per month in a specified time period. Below is a selection of the source data:

esProc_java_json_1

esProc script:

esProc_java_json_2

Import the JSON file as a two-dimensional table; perform a conditional query and then the group and aggregate. Both argBegin and argEnd are Java parameters. Here’s the result:

esProc_java_json_3

Parsing JSON files with different fields in each document

The documents of Data.json have different fields whose orders are different too, as shown below:

esProc_java_json_4

You need to export the file in Java as a CSV file, whose desired format is as follows:

esProc_java_json_5

esProc script:

esProc_java_json_6

Import the JSON file into memory and retrieve the People field, and write it into a file separated with commas. Here’s result of A2:

esProc_java_json_7

Exporting a multi-level JSON file as a CSV file

Below is the original JSON file:

esProc_java_json_8

You want to arrange the data in two columns in the CSV file, as shown below:

esProc_java_json_9

The difficulty is to join each of the root fields with its subfields whose names and field values need to be placed into two columns. Below is the esProc script:

esProc_java_json_10

Import the JSON file and divide data into 3 groups according to the root fields, as shown below (A2):

esProc_java_json_11

Then insert the record [root field name,null] before the first record of each group and concatenate the groups and export the data as a CSV file, as shown below:

esProc_java_json_12

Dynamically storing a JSON file in the database

s.json contains multi-level subdocuments, among which LIST and SERVICES have different numbers of fields. You need to export s.json to the database tables groups and service in Java. The requirement is that GROUPNAME values and field names of SERVICES will constitute the database table groups and the subfields of SERVICES will form another database table service; the two tables are related through groupsid field.

Below is the source data:

esProc_java_json_13

esProc script:

esProc_java_json_14

Import the JSON file in its original structure, loop through every subdocument of SUCCESS (B4-D10), during which every field of List(C5-D10) and Service (D7-D10) will be accessed, and append records to the empty two-dimensional tables in A2 and A3 and finally perform database updates. The script uses several functions to access a two-dimensional table. The fno function gets the number of fields, the fname function gets field names and the field function gets field values by sequence numbers.

Below are the results of A2 and A3 after data appending:

esProc_java_json_15

Posted in Java Assistant | Leave a comment

How to Use esProc as the Class Library for Java

Encapsulated lots of functions for handling structured file computing, esProc can import text files with complex formats, implement cursor-style processing with big files and simplify multithreaded parallel processing. A Java application can pass in parameters, execute an esProc script the same way as it executes a database stored procedure and gets the result set through JDBC.

Below is the structure of integrating an esProc script and a Java application:

esProc_java_application_library_1

An example will tell you how a Java application integrates an esProc script.

The sOrder.txt file, separated by tabs, holds a set of order information. You need to do a conditional query using Java based on this file and return orders in a specified time period.

Below is a selection of sOrder.txt:

esProc_java_application_library_2

Step 1: Implement the algorithm in esProc IDE

 

esProc_java_application_library_3

A1: Import the file, using the default separator tab. @t means importing the first row as the column names.

A2: Execute the conditional query. startDate and endDate are parameters transferred from Java. Their values are dates, like 2010-01-01and 2010-12-31.

Step 2: Check computing result in esProc IDE

You can check the computing result by clicking A2.

 

esProc_java_application_library_4

Step 3: Integrate esProc script into Java

The Java main program can call the esProc script via JDBC, using the following code:

        Class.forName(“com.esproc.jdbc.InternalDriver”);

                    con= DriverManager.getConnection(“jdbc:esproc:local://”);

                    // call esProc script (which is similar to the stored procedure); orderQuery is the name of the dfx file

                    st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call orderQuery (?,?)”);

                    st.setObject(1,”2010-01-01″);

                    st.setObject(2,”2010-12-31″);

                    // Execute the script

                    st.execute();

                    // Get the result set

                    ResultSet rs = st.getResultSet();

                            ……

esProc will return the last calculation cell by default, or it will return a certain cell using the return statement. The returned result is a ResultSet object in accordance with JDBC standards. The method of calling an esProc script is the same as that of accessing a database. Programmers who are familiar with JDBC can master it quickly.

The above example illustrates the normal way of integrating an esProc script by Java. Now let’s move on to look at the special cases:

Simple script without a file

A simple esProc script can be written directly in the Java application, without creating a script file. Thus the code in the preceding example can be like this:

  st = (com. esproc.jdbc.InternalCStatement)con.createStatement();

                   ResultSet rs1 = st.executeQuery(“=file(\”D:\\sOrder.txt\”).import@t()\n” + “=A1.select(OrderDate>=date(\”2010-01-01\”) && OrderDate<=date(\”2010-12-31\”))”);

You just need to use the carriage return “\n” to separate lines (and “\t” to separate columns).

You can also use the prepareStatementment object to execute the script, forcing a parameter type conversion. The SQL prepareStatementment object uses the quotation mark as the parameter placeholder. But as the quotation mark is a part of an esProc expression, you need to use the form of “arg1,arg2,arg3” to hold places sequentially. Here’s the code:

      st= (com. esproc.jdbc.InternalCStatement)con.prepareStatement(“=file(\”D:\\sOrder.txt\”).import@t()\n” + “=A1.select(OrderDate>=arg1 && OrderDate<=arg2)”);

                   java.util.Date  dateBegin  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-01-01″);    

                   java.sql.Date  sqlDateBegin  =  new java.sql.Date(dateBegin.getTime());

                   java.util.Date  dateEnd  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-12-31″);    

                   java.sql.Date  sqlDateEnd  =  new java.sql.Date(dateEnd  .getTime());

                   st.setDate(1, sqlDateBegin); 

                   st.setDate(2, sqlDateEnd ); 

                   ResultSet rs1 = st.executeQuery(); 

Big result set

Sometimes the computing result is too big to be held in the memory. In that case you can use an esProc cursor function to return it, and use JDBC flow to access Java. For example, to query the big file sOrderBig.txt by a certain time period, you could use the following esProc code:

 

esProc_java_application_library_5

cursor function opens the file with cursor and select function returns the query result as a cursor.

The following is the code of integrating esProc script by Java:

st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call orderBigQuery (?,?)”);

           st.setObject(1,”2010-01-01″);

           st.setObject(2,”2010-12-31″);

           st.setFetchSize(1000);// Set the number of records retrieved in each batch

           st.execute();

           ResultSet rs = st.getResultSet();

           while (rs.next()) {

                                     ……

                            }

Database computing

You can handle the database computing involved in the computation in esProc and return the result all at once through esProc JDBC, without having to integrating the database in Java. For example, you can use the following code to align the database table emp with sOrder.txt:

 

esProc_java_application_library_6

See details about accessing database in esProc from How esProc Assists Writing SQL Queries.

And learn more about the integration of esProc script by Java in esProc Integration & Application: Java Invocation.

Posted in Java Assistant | Leave a comment