How esProc Assists Java to Query Big Text Files

Sometimes you need to query a big text file, instead of the database. In those cases, you need to retrieve the file in a stream style to perform the query algorithm, in which parallel processing is needed to improve the computing performance. As Java lacks the class library for doing these, you have to use hardcoding to handle the processing of structured data, and, as a result, the code is complicated, unreadable, and inefficient in performing parallel processing.

You can use esProc (free edition is now available) to make up for what Java lacks. Encapsulated with rich functions for reading and writing structured data and cursor functions, esProc handles parallel processing with simple code. It also provides the easy-to-use JDBC interface. The Java application can identify an esProc script as the database to execute, pass parameters to it and get the result set via JDBC. You can learn more details from How to Use esProc as the Class Library for Java.

Here’s an example for explaining the process of how esProc helps Java in querying the big text file. Below is the source data:

esProc_java_query_bigtextfile_2

To query orders whose dates are between startDate and endDate and whose amounts are greater than argAmount, use the following code:

esProc_java_query_bigtextfile_3

Open the file as a cursor with the cursor function; @t means importing the first row as column names. Then perform structured query, and fetch data from the query result into the memory if the result is not big. The result is as follows:

esProc_java_query_bigtextfile_4

If the memory cannot hold the query result, you can return a cursor directly from the esProc script (i.e. delete A3’s code). And then the Java application can fetch data from the returned cursor in JDBC stream style.

esProc also supports multithreaded parallel processing. The simplest way is using @m option with cursor function in the preceding code. The option means retrieving file with multiple threads.

Or you can segment the file manually to use multiple threads in both data retrieval and data computing. The code is as follows:

esProc_java_query_bigtextfile_5

It opens the file with 8 cursors, each retrieving a specified part of the file. @z means dividing the file roughly into multiple segments by bytes and retrieving one of them each time. esProc will automatically skip the head row and make up the tail row to ensure that each row is retrieved completely.

The conj function can merge the results. @x means the objects of merge operation are cursors; @m means performing parallel computing. Note that the function cannot guarantee the order consistency of the records in the result set and the original file.

The preceding code uses the esProc built-in function for parallel processing. If the algorithm is complicated and there is enough memory space for holding the result set, it is better to use explicit parallel statement. The code is as follows:

esProc_java_query_bigtextfile_6

It uses 8 parallel threads to retrieve and process the big file, and each return the result to the main program after the query is done. The fork function works to execute these threads. Its working range is B2-B3, within which you can use A2 to get the entry parameter and outside which you can use A2 to get the results of all the threads.

For ordered data, you can use binary search to increase the query performance. For example, data has been sorted by Client and OrderID, you need to find corresponding records according to parameters argClient and argOrder. To do this, use the following code:

esProc_java_query_bigtextfile_7

begin and end are beginning position and ending position specified for the binary search. m is the middle position.

Locate the middle position by bytes and retrieve a record with cursor. esProc will automatically skip the head row and make up the tail row to retrieve a complete row. If the locating is successful, store the current record in C5. If it fails, continue to compare the related sets and reset begin and end.

Posted in Java Assistant | Tagged , , | Leave a comment

How esProc Assists Java in Handling JSON Data

Java can handle simple parsing of semi-structured JSON data, but has difficulty in handling the in-depth processing. esProc, which supports set operations, order-related operations and dynamic script execution, can help reduce the difficulty. A Java application will execute an esProc script in the same way as it executes a database stored procedure, pass parameters to the script and get the returned esProc result through JDBC. For more details, see How to Use esProc as the Class Library for Java.

Below are examples of the most common problems you may encounter in handling JSON data in Java, and their esProc solutions.

Grouping and aggregating JSON data

order.json stores order records. You need to calculate the total sales amount a client contributes per month in a specified time period. Below is a selection of the source data:

esProc_java_json_1

esProc script:

esProc_java_json_2

Import the JSON file as a two-dimensional table; perform a conditional query and then the group and aggregate. Both argBegin and argEnd are Java parameters. Here’s the result:

esProc_java_json_3

Parsing JSON files with different fields in each document

The documents of Data.json have different fields whose orders are different too, as shown below:

esProc_java_json_4

You need to export the file in Java as a CSV file, whose desired format is as follows:

esProc_java_json_5

esProc script:

esProc_java_json_6

Import the JSON file into memory and retrieve the People field, and write it into a file separated with commas. Here’s result of A2:

esProc_java_json_7

Exporting a multi-level JSON file as a CSV file

Below is the original JSON file:

esProc_java_json_8

You want to arrange the data in two columns in the CSV file, as shown below:

esProc_java_json_9

The difficulty is to join each of the root fields with its subfields whose names and field values need to be placed into two columns. Below is the esProc script:

esProc_java_json_10

Import the JSON file and divide data into 3 groups according to the root fields, as shown below (A2):

esProc_java_json_11

Then insert the record [root field name,null] before the first record of each group and concatenate the groups and export the data as a CSV file, as shown below:

esProc_java_json_12

Dynamically storing a JSON file in the database

s.json contains multi-level subdocuments, among which LIST and SERVICES have different numbers of fields. You need to export s.json to the database tables groups and service in Java. The requirement is that GROUPNAME values and field names of SERVICES will constitute the database table groups and the subfields of SERVICES will form another database table service; the two tables are related through groupsid field.

Below is the source data:

esProc_java_json_13

esProc script:

esProc_java_json_14

Import the JSON file in its original structure, loop through every subdocument of SUCCESS (B4-D10), during which every field of List(C5-D10) and Service (D7-D10) will be accessed, and append records to the empty two-dimensional tables in A2 and A3 and finally perform database updates. The script uses several functions to access a two-dimensional table. The fno function gets the number of fields, the fname function gets field names and the field function gets field values by sequence numbers.

Below are the results of A2 and A3 after data appending:

esProc_java_json_15

Posted in Java Assistant | Leave a comment

How to Use esProc as the Class Library for Java

Encapsulated lots of functions for handling structured file computing, esProc can import text files with complex formats, implement cursor-style processing with big files and simplify multithreaded parallel processing. A Java application can pass in parameters, execute an esProc script the same way as it executes a database stored procedure and gets the result set through JDBC.

Below is the structure of integrating an esProc script and a Java application:

esProc_java_application_library_1

An example will tell you how a Java application integrates an esProc script.

The sOrder.txt file, separated by tabs, holds a set of order information. You need to do a conditional query using Java based on this file and return orders in a specified time period.

Below is a selection of sOrder.txt:

esProc_java_application_library_2

Step 1: Implement the algorithm in esProc IDE

 

esProc_java_application_library_3

A1: Import the file, using the default separator tab. @t means importing the first row as the column names.

A2: Execute the conditional query. startDate and endDate are parameters transferred from Java. Their values are dates, like 2010-01-01and 2010-12-31.

Step 2: Check computing result in esProc IDE

You can check the computing result by clicking A2.

 

esProc_java_application_library_4

Step 3: Integrate esProc script into Java

The Java main program can call the esProc script via JDBC, using the following code:

        Class.forName(“com.esproc.jdbc.InternalDriver”);

                    con= DriverManager.getConnection(“jdbc:esproc:local://”);

                    // call esProc script (which is similar to the stored procedure); orderQuery is the name of the dfx file

                    st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call orderQuery (?,?)”);

                    st.setObject(1,”2010-01-01″);

                    st.setObject(2,”2010-12-31″);

                    // Execute the script

                    st.execute();

                    // Get the result set

                    ResultSet rs = st.getResultSet();

                            ……

esProc will return the last calculation cell by default, or it will return a certain cell using the return statement. The returned result is a ResultSet object in accordance with JDBC standards. The method of calling an esProc script is the same as that of accessing a database. Programmers who are familiar with JDBC can master it quickly.

The above example illustrates the normal way of integrating an esProc script by Java. Now let’s move on to look at the special cases:

Simple script without a file

A simple esProc script can be written directly in the Java application, without creating a script file. Thus the code in the preceding example can be like this:

  st = (com. esproc.jdbc.InternalCStatement)con.createStatement();

                   ResultSet rs1 = st.executeQuery(“=file(\”D:\\sOrder.txt\”).import@t()\n” + “=A1.select(OrderDate>=date(\”2010-01-01\”) && OrderDate<=date(\”2010-12-31\”))”);

You just need to use the carriage return “\n” to separate lines (and “\t” to separate columns).

You can also use the prepareStatementment object to execute the script, forcing a parameter type conversion. The SQL prepareStatementment object uses the quotation mark as the parameter placeholder. But as the quotation mark is a part of an esProc expression, you need to use the form of “arg1,arg2,arg3” to hold places sequentially. Here’s the code:

      st= (com. esproc.jdbc.InternalCStatement)con.prepareStatement(“=file(\”D:\\sOrder.txt\”).import@t()\n” + “=A1.select(OrderDate>=arg1 && OrderDate<=arg2)”);

                   java.util.Date  dateBegin  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-01-01″);    

                   java.sql.Date  sqlDateBegin  =  new java.sql.Date(dateBegin.getTime());

                   java.util.Date  dateEnd  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-12-31″);    

                   java.sql.Date  sqlDateEnd  =  new java.sql.Date(dateEnd  .getTime());

                   st.setDate(1, sqlDateBegin); 

                   st.setDate(2, sqlDateEnd ); 

                   ResultSet rs1 = st.executeQuery(); 

Big result set

Sometimes the computing result is too big to be held in the memory. In that case you can use an esProc cursor function to return it, and use JDBC flow to access Java. For example, to query the big file sOrderBig.txt by a certain time period, you could use the following esProc code:

 

esProc_java_application_library_5

cursor function opens the file with cursor and select function returns the query result as a cursor.

The following is the code of integrating esProc script by Java:

st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call orderBigQuery (?,?)”);

           st.setObject(1,”2010-01-01″);

           st.setObject(2,”2010-12-31″);

           st.setFetchSize(1000);// Set the number of records retrieved in each batch

           st.execute();

           ResultSet rs = st.getResultSet();

           while (rs.next()) {

                                     ……

                            }

Database computing

You can handle the database computing involved in the computation in esProc and return the result all at once through esProc JDBC, without having to integrating the database in Java. For example, you can use the following code to align the database table emp with sOrder.txt:

 

esProc_java_application_library_6

See details about accessing database in esProc from How esProc Assists Writing SQL Queries.

And learn more about the integration of esProc script by Java in esProc Integration & Application: Java Invocation.

Posted in Java Assistant | Leave a comment

How to Use esProc as the Class Library of Structured File Computing for Java

In some cases, data must be stored in the file system, rather than in a database. That requires handling file-based data computing manually. Since Java lacks the related class library, you have to hardcode the structured file computing, which produces complicated and unreadable code.

Yet, you can make up for this deficiency with esProc (free edition is available). esProc encapsulates plenty of functions for processing structured files and provides the JDBC interface. A Java application will treat an esProc script as a database stored procedure, execute it after passing parameter and get the result set via JDBC. You can learn more details from How to Use esProc as the Class Library for Java.

Now let’s look at some familiar scenarios involving structured computing in Java and their esProc solutions.

Querying text file

sOrder.txt is a text file separated by tabs. You need to query orders in a specified time period in Java.

Below is the source data:

esProc_java_class_library_2

esProc code:

esProc_java_class_library_3

Import the file, using tab as the default separator. @t means importing the first row as column names. Then perform the conditional filtering. startDate and endDate are input parameters, like a period from 2010-01-01 to 2010-12-31. Here’s the result:

esProc_java_class_library_4

Sorting text file

Based on the above sOrder.txt, you need to sort records in Java by client numbers in descending order and by year and month in ascending order.

esProc code: =A1.sort(-Client,year(OrderDate),month(OrderDate))

Explanation: Use “-” to sort data in descending order. You need to do some calculation to get the years and the months.

Related information: To perform sort after the querying, you can use =A2.sort(…), or =A1.select(…).sort(…)

Here’s the result:

esProc_java_class_library_5

Group and aggregate operations

You need to calculate the sales amount and number of orders per seller per year.

esProc code: =A1.groups(SellerId,year(OrderDate);sum(Amount),count(~))

Explanation: group function can perform aggregate while grouping data. ~ represents each group or the current group. count(~) is equivalent to count(OrderID).

Here’s the result:

esProc_java_class_library_6

Getting distinct value

You need to make a client list according to sOrder.

esProc code: =A1.id(Client)

Here’s the result:

esProc_java_class_library_7

Removing duplicate values

You need to get the first record for each client and for each seller

esPrco code: =A1.group@1(Client,SellerId)

Explanation: group function is used to group records (without having to aggregate them). @1 means getting the first record from each group.

Here’s the result:

esProc_java_class_library_8

Finding topN

You need to get the 3 orders with the greatest sales amount for each seller.

esProc code: =A1.group(SellerId;~.top(3,-Amount):t).conj(t)

Explanation: top function filters records to get TopN. The sign “-” means a reversed order. conj function is used for concatenation.

Related information: To get the order with the greatest sales amount, use maxp function.

Here’s the result:

esProc_java_class_library_9

Related computing

emp.txt is a text file, separated by tabs. Its EId field corresponds to sOrder’s SellerId field. Now with Java, you need to align Name, Dept and Gender fields in emp.txt with sOrder.txt.

Below is the source data:

esProc_java_class_library_11

esProc code:

esProc_java_class_library_10

The join function performs a join and changes names of the two tables into s and e respectively; @1 means a left join. Then you retrieve the desired fields from the joined table to create a new structured two-dimensional table. Here’s the result:

esProc_java_class_library_12

Related information: @1 means a left join; @f means a full join. No option is needed for an inner join.

All the above examples assume that the size of the file is relatively small. If the file is too big to be entirely loaded into memory, you can use the esProc cursor to handle it. See related documents for detailed information.

Posted in Java Assistant | Tagged , , | Leave a comment

Structured Data Computing: the Focus of Routine Data Analysis

Essentially, the main content of various data analyses in our routine business is the structured data computing. For example:

  • Compute the link relative ratio and year-on-year comparison of each business branch’s monthly sales during a specified period of time.

Implementation approach: filter the sales data by time range, then group and summarize data by business branch, year and month, and at last, perform cross-row and –group ordered data computing.

  • Select stocks whose closing price has been increasing uninterruptedly for over 10 days.

Implementation approach: Group daily transaction data by stocks and sort the data of each group by dates, compute the increasing amount of the share price and the number of days during which the share price increases uninterruptedly, and filter away the stocks that have been rising uninterruptedly for over 10 days.

  • Relate the data of different sources, like contract and payment information, to project payments schedule and find out the overdue projects.

Implementation approach: Perform relational computing between heterogeneous data sources, then group, summarize and filter the data.

It can be seen that these routine problems of data analysis can be split into structured data operations including filtering, grouping, summarizing, sorting, ranking and relational computing.

Of course, we may need to solve data analysis problems of modeling or prediction occasionally. For example, find out goods that are closely related between each other, or predict which stock is supposed to rise, and the like. These operations require quite a lot of mathematical knowledge which ordinary staff is generally not likely to have. They are really important data analysis transaction, but they occupy only a very small part of routine data analysis.

Structured data computing is the focus. There are many tools that can perform it, like R language, Python, SQL and esProc.

R language provides dataframe data type for structured data computing. However, it was originally designed for collecting and analyzing scientific data, especially for performing matrix and vector computations. It is not professional for structured data computing.

In fact, dataframe is a newly-developed function of R language; its strong point is algorithms of modeling and prediction, such as regression analysis, ANOVA analysis, Agreementevaluation, and Bernoulli distribution, etc, which are seldom used in routine data analysis.

Pandas, Python’s third party function library, can perform structured data computing. But it was also designed for collecting and analyzing scientific data instead of structured data computing, so it is not professional too. And similar to R language, the functions of Pandas center on modeling and prediction and are seldom used in routine data analysis.

We can see that, despite lots of tools for performing structured data computing, few can be regarded as truly professional. There are only one professional, SQL, the old brand computer language.

SQL was designed purely for structured data computing. It is professional and widely used.

Yet it also has drawbacks for routine data analysis. The most obvious ones are complicated application environment and being bad at ordered data computing. The installation, configuration, maintenance and management of SQL are very complicated. SQL data set hasn’t inherent serial numbers and gets disadvantaged in ordered data computing, for example, the common problems in routine data analysis like link relative ratio, year-on-year comparison, fetching data in a relative interval, performing ranking during data grouping and getting records in the top and bottom, etc. Most of the examples we mentioned at the beginning involve ordered data computing. And though we can solve them with SQL, the operation will be quite difficult.

Similar to SQL, esProcis specially designed for structured data computing.

By comparison, esProc’s application environment, installation and configuration are simple. esProc can fetch data from databases, and import structured data directly from Txt, logs and Excel. Moreover, esProc table sequence has inherent serial numbers, enabling it to perform ordered data computing easily. Unfortunately, in esProc, the syntax for external memory computing is different from that for in-memory computing, which requires different code. In this respect, SQL has better consistency in its syntax.

Posted in Unique | Tagged , , , | Leave a comment

Comparison Between esProc’s Sequence Table Object and R’s Data Frame

Both esProc and R language are typical data processing and analysis languages with two-dimensional structured data objects. They are all good at multi-step complex computations. However their two-dimensional structured data objects are quite different from each other in the underlying mechanism. As a result, esProc is better at computation with structured data, and especially suitable for developers to do business computing. R is better at matrix computation and more suitable for scientists to do scientific or engineering computation.

esProc’s two-dimensional structured data type is sequence table object (TSeq). Sequence table is based on records, with multiple records forming a row-styled two-dimensional table. In combination with the column name, this two-dimensional table can form a complete data structure. R language is based on vector, with multiple vectors forming a column-styled two-dimensional table. In combination with the column name, the two-dimensional table can form a complete data structure.

These underlying mechanisms affect actual user experience. In the following part we will compare the difference in practical use between sequence table object and data frame, in terms of basic functions, advanced features, actual use cases and test results.

Note: Primitive functions of development language are to be used in the following comparisons, the third party extension packages won’t be involved.

Basic functions

Example 1:retrieve two-dimensional structured data from the file, and access the value of the second column in the first row by coordinates.

Data frame:

data<-read.table(“e:/sales.txt”,header=TRUE,sep=”\t”)

result<-data[1,2]

Sequence table:

=data=file(“e:/sales.txt”).import@t()

=data(1).#2

Comparison: there is no significant difference in the most basic functions.

Note: the sales.txt file is tab separated structured data, and the first few lines are as following:

esProc_r_seq_dataframe_1

Example 2: access the value of the second column in the first row, by row number and by field name.

Data frame:

Result1<-data$Client[1]

Result2<-data[1,]$Client

Sequence table:

=data(1).(Client)

=data.(Client)(1)

Comparison: there is no significant difference between the two.

Example 3: Access column data. There are two scenarios, and each falls into two situations: access by column number and column names:retrieve only the second column, or retrieve a combination of the second column and the fourth column.

Data frame:

Result1<-data[2]

Result2<-data[,c(2,4)]

Result3<-data$Client

Result4<-data[,c(“Client”,”Amount”)]

Sequence table:

=data.(#2)

=data.new(#2,#4)

=data.(Client)

=data.new(Client,Amount)

Comparison: Both can access the column data. The only difference is in the syntax for retrieving multiple column data. Data frame is retrieving the number directly, while with sequence table a new sequence table will be build with the new function. Although the syntax is different, the actual methods used are the same: both are duplicating two columns of data from the original objects to new objects.

Example 4: record manipulation. Includes: retrieve the first two records, appending records, inserting record in the second row, deleting the record in the second row.

Data frame:

Record1<-data[c(1,2),]

append<- data.frame(OrderID=152,  Client=”CA”,       SellerId=5,        Amount=2961.40,   OrderDate=”2010-12-5 0:00:00″)

data<- rbind(data, append)

insert<-data.frame(OrderID=153,  Client=”RA”,  SellerId=4,     Amount=1931.20,   OrderDate=”2009-11-5 0:00:00″)

data<-rbind(data[1,], insert,data[2:151,])

data<-data[-2,]

Sequence table:

=data([1,2])

=data.insert(0,152:OrderID,”CA”:Client,5:SellerId,2961.40:Amount,”2010-12-5 0:00:00″:OrderDate)

=data.insert(2,153:OrderID,”RA”:Client,4:SellerId,1931.20:Amount,”2009-11-5 0:00:00″:OrderDate)

=data.delete(2)

Comparison: record manipulation is possible in both ways. esProc is relatively more convenient. It can use insert function to append or insert records directly to sequence table, while in R language we need to split the data frame and then merge them again to achieve the same result in an indirect way.

Summary: As both sequence table and data frame are structured, two-dimensional data object, no significant difference exists in basic functions for data reading/writing,data access and maintenance.

Advanced features

Example 5: modifying the association. A1, A2 are two-dimensional structured data object with the same field ID. We now need to add the bonus field values of A2 to the salary field values ​​in A1 according to ID.

Sequence table:

A1=db.query(“select id,name,salary from salary order by id”)

A2=db.query(“select id,bonus from bonus order by id”)

A1.modify(1:A2,salary+bonus:salary)

Data frame has no functions to modify the association. We need to do manual coding for this, which is omitted here.

Example 6: merging associations. A1, A2, A3 are two-dimensional structured data objects with the same field sequence number. Please associate them with left join. As the data is sorted by sequence number, please leverage merging methods to improve the speed for association.

Sequence Table:join@m1(A1:salary,id;     A2:bonus,id;   A3,attendance,id)

Data frame supports association of two tables, such as: merge(A1,A2,by.x=”id”,by.y=”id”,all=TRUE).In this case three tables are associated, which can be achieved indirectly through two two-tableassociations.

In addition, the data frame does not support merging of association, and therefore no speed improvement is possible. In other words, data frame cannot use ordered sequence data to improve performance, not only with association, but also with other operations.

Example 7: Record lookup. Four scenarios: retrieving records with the Amount greater than 1000; retrieving the sequence number or records with the Amount greater than 1000;return records with primary key value of “v”, return the sequence number for records with primary key value of “v”.

Sequence table:

=data.select(Amount>1000)

= data.pselect(Amount>1000)

= data.find(v)

= data.pfind(v)

Data frame:only the first two scenarios can be achieved, which is done with following code:

newdata<- data [data $ Amount>1000,]

which(data $ Amount >1000)

Data frame hasn’t the concept of major key, so we need to do manual coding for other 2 scenarios as indirect methods, or employ a third party package (i.e. data.table). The codes are omitted here.

Example 8: Group sum. The data is grouped by Client and SellerId. Then the other two fields are aggregated: do a sum for Amount field, and do a count for OrderID field.

Sequence table:

=data.groups(Client,SellerId;sum(Amount),count(OrderID))

Data frame:only support single field aggregation, such as the sum of Amount. As following:

result<aggregate(data[,4],data[c(2,3)],sum)

To do aaggregation of two fields at the same time with data frame, we can only use two separate aggregate statements and then merge the results. Codes are omitted here.

Example 9: Reuse grouping. Group data by Client. Complete multiple subsequent computations on group result. Including: aggregation by amount, and count after grouping by SellerId.

Sequence table:

A2=data.group(Client)

=A2.(~.sum(Amount))

=A2.(~.groups(SellerId;count(OrderID)))

Data frame does not support reuse of grouping directly. Grouping and aggregation usually need to be done in one step. This means we need to do two identical grouping operations to accomplish the same purpose. As following:

result<-aggregate(data[,2],sum)

result<-aggregate(data[,2],data[,3],count)

If we want to reuse grouping, we must use split function and loop to achieve this. The code is both lengthy and with low performance.

Summary:

Sequence tables and data frame are quite different in terms of advanced features. This is mainly demonstrated in the following five ways:

  1. Richness of features. Sequence table has rich functions, and is very convenient to do structured data computation. Data frame originates from matrix, with less support for structured data and lack of many features. Use of the third party packages can in some degree supplement the functions data frame lacks, but these packages are no match for R’s primitive library function in muturity and stability.
  2. Difficulty in syntax. The function names of sequence table are more intuitive.For example, select means to find; pselect is to find the location (position). With data frame the syntax is relatively obscure. For example, “find by field” is data [data $ Amount> 1000,], and retrieve value by field is data[,”Amount”]. These two are confusing and difficult for the programmer to understand. One must have some knowledge on vector to grasp it.
  3. Memory consumption. Basically sequence table function only returns a reference, with very little memory occupation. Data frame must copied record from the original object. If we need to do multiple search, association and grouping operations on large amounts of data, data frame’s memory consumption will be very large. It will impact the whole system.
  4. Code workload and code performance. The functions supported by data frame are not rich enough. We need to do hand-coding to achieve this indirectly. This means more workload. The R interpreter is known to be very slow. With hand-coding the performance is much lower than library functions.
  5. Library function performance. Sequence table has many functions to improve computing performance, such as merging association, grouping functions, binary search, hash lookup. Although data frame supports association, aggregation and search, it’s hard to improve the performance.

Actual case

In this part we use a real case for comprehensive comparison o fdata frame and sequence table.

Computation target: according to daily transactions, selecting stocks from blue-chip stocks whose prices rises in 5 days in a row.

Ideas: Importing data; filtering out previous month’s data; grouped them according to the ticker; sort the data by dates; compute the growth amount for closing price over previous day; compute the number of days for continuous positive growth; filtering out the stocks which rise in 5 or more days in a row.

Sequence Table Solution:

esProc_r_seq_dataframe_2

Data frame Solution:

esProc_r_seq_dataframe_3

Comparison

  1. Data frame function is not rich enough, and is lack of professionalism. We need to use nested loops to meet the requirement in this case. It’s of low computational efficiency. Sequence table has rich and diverse functions. Without the use of loop statement we can achieve the same purpose. The code is shorter and simpler, and the performance is higher.
  2. When programming for data frame, the code is obscure and hard to write. With sequence table, the code is clear and easy to understand. The cost of learning is lower.
  3. When large amount of data is involved in this scenario, the memory consumption will be huge. Sequence table is computationby reference, which consumes less memory. Data frame is computation by value pass. The memory consumption is several times more than sequence table. It easy to result into memory overflow in this scenario.
  4. To import Excel data into data frame, R requires third-party software packages. However they seem to have difficulty working together. Data import needs ten minutes to complete. With sequence table this only needs tens of seconds.

Test Performance

Test 1: Generating 10 million records in memory, each consists of three fields. All values ​​are random numbers. Records are filtered, and each field is summed.

Sequence table:

esProc_r_seq_dataframe_4

Data frame:

esProc_r_seq_dataframe_5

Comparison: sequence table needs 50.534 seconds, while data frame needs 91.999 seconds. The gap is obvious.

Test 2: Retrieving 1.2G txt file. Do filtering and sum on two fields

Sequence Table

esProc_r_seq_dataframe_6

Data frame

esProc_r_seq_dataframe_7

Comparison: sequence table takes 87.122 seconds, while data frame takes 1.1347 hours. The performance difference is tens of times. The reason for this is mainly due to the extremely low speed for file reading.

From the above comparison, we can see that sequence table are better than data frame in terms of rich features, easy syntax, memory consumption, development effort, library function performance and coding performance, etc.. Of course, data frame is not the full strength of R language. R has a powerful vector matrix and the associated mass functions, which make it more professional than esProc in scientific and engineering computation.

Posted in esProc/R/Python/Perl, FAQ | Tagged , , , , | Leave a comment

Comparison of Loop Function in esProc and R Language

Loop function can traverse every member of an array or a set, express complicated loop statements with simple functions, as well as reduce the amount of code and increase readability. Both esProc and R language support the loop function. The following will compare their similarities and differences in usage.

1. Generating data

Generate odd numbers between 1 and 10.

esProc:

x=to(1,10).step(2)

In the code, to(1,10) generates consecutive integers from 1 to 10, step function gets members inconsecutively according to the computed result of last step and the final result is [1,3,4,5,7,9]. This type of data in esProc is called a sequence.

The code has a simpler version: x=10.step(2).

R language:

x<-seq(from=1,to=10,by=2)

This piece of code gets integers directly and inconsecutively from 1 to 10. Computed result is c(1,3,4,5,9). This type of data in R language is called vector.

A simpler version of this piece of code is x<-seq(1,10,2).

Comparison:

  1. Both can solve the problem in this example. esProc needs two steps to solve it, indicating theoretically a poor performance. While R language can resolve it with only one step, displaying a better performance.
  2. The method for esProc to develop code is getting members from a set according to the sequence number. It is a common method. For example, there is a string sequence A1=[“a”, “bc”, “def”……], now get strings in the positions of odd numbers. Here it’s no need to change the type of code writing, the code is x=A1.step(2).

R language generates data directly, thus it has a better performance. It can write common expressions, too. For example, get strings in the positions of odd numbers from the string vector quantity A1=c(“a”, “bc”, “def”……), the expression in R language can be x=A1[seq(1,length(A1),2)].

  1. esProc loop function has characteristics that R language hasn’t, that is, built-in loop variables and operators. “~” represents the loop variable, “#” represents the loop count, “[]” represents relative position and “{}” represents relative interval. By using these variables and operators, esProc can produce common concise expressions. For example, seek square of each member of the set A2=[2,3,4,5,6]:

A2.(~*~)                              / Result is [4,9,16,25,36], which can also be written as A2**A2. But the latter lacks a sense of immediacy and commonality. R language can only use A2*A2 to express the result.

Get the first three members:

A2.select(#<=3)                 / Result is [2,3,4]

Get each member’s previous member and create a new set:

A2.(~[-1])                             / Result is [null,2,3,4,5]

Growth rate:

A2.((~ – ~[-1])/ ~[-1])         /Result is [null, 0.5, 0.33333333333, 0.25, 0.2]

Moving average:

A2.(~{-1,1}.avg())               /Result is [2.5, 3.0, 4.0, 5.0, 5.5]

Summary:In this example, that R language can directly generate data and produce common expressions shows that it is more flexible and takes less memory space than esProc.

2.Filtering records

Computational objects of a loop function can be an array or a set whose members are single value, or two-dimensional structured data objects whose members are records. In fact, loop function is mainly used in processing the latter. For example, select orders of 2010 whose amount is greater than 2,000 from sales, the order records.

Note: sales originates from a text file, some of its data are as follows:

esProc_r_loopfunction_1

esProc:

sales.select(ORDERDATE>=date(“2010-01-01”) && AMOUNT>2000)

Some of the results are:

esProc_r_loopfunction_2

R language:

sales[as.POSIXlt(sales$ORDERDATE)>=as.POSIXlt(“2010-01-01”) & sales$AMOUNT>2000,]

Some of the results are:

esProc_r_loopfunction_3

Comparison:

  1. Both esProc and R language can realize this function. Their difference lies that esProc uses select loop function while R language directly uses index. But there isn’t an essential distinction between them. In addition, R language can further simplify the expression by using attach function:

sales[as.POSIXlt(ORDERDATE)>=as.POSIXlt(“2010-01-01”) & AMOUNT>2000,]

Thus, there are more similarities between them.

  1. Except query, loop function can be used to seek sequence number, sort, rank, seek Top N, group and summarize, etc. For example, seek sequence numbers of records.

sales.pselect@a(ORDERDATE>=date(“2010-01-01”) && AMOUNT>2000)   / esProc

which(as.POSIXlt(sales$ORDERDATE)>=as.POSIXlt(“2010-01-01”) & sales$AMOUNT>2000) #R language

For example, sort records by SELLERID in ascending order and by AMOUNT in descending order.

sales.sort(SELLERID,AMOUNT:-1)                              / esProc

sales[order(sales$SELLERID,-sales$AMOUNT),]    /R language

For example, seek the top three records by AMOUNT.

sales.top(-AMOUNT;3)                                                   / esProc

head(sales[order(-sales$AMOUNT),],n=3)               /R language

  1. Sometimes, R language computes with index, like filtering; sometimes it computes with functions, like seeking sequence numbers of records; sometimes it programs in the form of “data set + function + data set”, like sorting; and other times it works in the way of “function + data set + function”, like seeking Top N. Its programming method seems flexible but is liable to greatly confuse programmers. By comparison, esPoc always adopts object-style method “data set + function + function …” in access. The method has a simple and uniform structure and is easy for programmers to grasp.

Here is an example of performing continuous computations. Filter records and seek Top N. esProc will compute like this:

sales.select(ORDERDATE>=date(“2010-01-01”) && AMOUNT>2000). top(AMOUNT;3)

And R language will compute in this way:

Mid<-sales[as.POSIXlt(sales$ORDERDATE)>=as.POSIXlt(“2010-01-01”) & sales$AMOUNT>2000,]

head(Mid [order(Mid$AMOUNT),],n=3)

As you can see, esProc is better at programming multi-step continuous computations.

Summary:In this example, esPoc gains the upper hand in ensuring syntax consistency and performing continuous computations, and is more beginner-friendly.

3. Grouping and summarizing

The loop function is often employed in grouping and summarizing records. For example, group by CLIENT and SELLERID, and then sum up AMOUNT and seek the maximum value.

esProc:

sales.groups(CLIENT, SELLERID;sum(AMOUNT),max(AMOUNT))

Some of the results are as follows:

esProc_r_loopfunction_4

R language:

result1<-aggregate(sales[,4], sales[c(3,2)],sum)

result2<-aggregate(sales[,4], sales[c(3,2)],max)

result<-cbind(result1,result2[,3])

Some of the results are as follows:

esProc_r_loopfunction_5

Comparison:

  1. In this case, more than one summarizing method is required. esProc can complete the task in one step. R language has to go through two steps to sum up and seek the maximum value, and finally, combine the results with cbind, because its built-in library function cannot directly use multiple summarizing methods simultaneously. Besides, R language will have more memory usage in completing the task.
  2. Another thing is the illogical design in R language. For sales[c(3,2)], the group order in the code is that SELLERID is ahead of CLIENT, but in business, the order is completely opposite. In the result, the order changes again and becomes the same as that in the code. In a word, there is not a unified standard for business logic, the code and the computed result.

Summary:In this example, esProc has the advantages of high efficiency, small memory usage and having a unified standard.

4. Seeking quadratic sum

Use a loop function to seek quadratic sum of the set v=[2,3,4,5].

Please note that both esProc and R language have functions to seek quadratic sum, but a loop function will be used here to perform this task.

esProc:

v.loops(~~+~*~ ; 0)

R language:

Reduce(function(x,y) x+y*y, c(0,v))

Comparison:

  1. Both esProc and R language can realize this function easily.
  2. The use of loops function by esProc means that it sets zero as the initial value, computes every member of v in order and returns the final result. In the code, “~” represents member being computed and “~~” represents computed result of last step. For example, the arithmetic in the first step is 0+2*2 and that in the second step is 4+3*3, and so forth. The final result is 54.

The use of reduce function by R language means that it computes members of [0,2,3,4,5] in order, and puts the computed result of the current step into the next one to go on with the computation. As esProc, the arithmetic in the first step is 0+2*2 and that in the second step is 4+3*3, and so forth.

  1. R language employs lambda expression to perform the operation. This is one of the programming methods of anonymous functions, and can be directly executed without specifying the function name. In this example, function(x,y), the specification, defines two parameters; x+y*y, the body, is responsible for performing the operation; c(0,v) combines 0 and v into [0,2,3,4,5] in which every member will take part in the operation in order. Because it can input a complete function, this programming method becomes quite flexible and is able to perform operations containing complicated functions.

The esProc programming method can be regarded as an implicit lambda expression, which is essentially the same as the explicit expression in R language. But it has a bare expression without function name, specification and variables and its structure is simpler. In this example, “~” represents the built-in loop variable unnecessary to be defined; ~~+~*~ is the expression responsible for performing the operation; v is a fixed parameter in which every member will take part in the operation in order. Being unable to input a function, it is not as good as R language theoretically in flexibility and ability of expression.

  1. Despite being not flexible enough in theory, esProc programming method boasts convenient built-in variables and operators, like ~, ~~, #, [], {}, etc., and gets a more powerful expression in practical use. For example, esProc uses “~~” to directly represent the computed result of last step, while R language needs reduce function and extra variables to do this. esProc can use “#” to directly represent the current loop number while R language is difficult to do this. Also, esProc can use “[]” to represent relative position. For example, ~[1] is used to represent the value of next member and Close[-1] is used to represent value of the field Close in the last record.

In addition, esProc can use “{}” to represent relative interval. For example, {-1,1} represents the three members between the previous and next member. Therefore, the common expression v.(~{-1,1}.avg()) can be used to compute moving average, while R language needs specific functions to do this. For example, there is even no such a function for “seeking average” in the expression filter(v/3, rep(1, 3),sides = 1), which is difficult to understand for beginners.

Summary: In this case, the lambda expression in R language is more powerful in theory but is a little difficult to understand. By comparison, esProc programming method is easier to understand.

5. Inter-rows and –groups operation

Here is a table stock containing daily trade data of multiple stocks. Please compute daily growth rate of closing price of each stock.

Some of the original data are as follows:

esProc_r_loopfunction_6

esProc:

A10=stock.group(Code)

A11=A10.(~.sort(Date))

A12=A11.(~.derive((Close-Close[-1]):INC))

R language:

A10<-split(stock, stock $Code)

for(I in 1:length(A10){

A10[[i]][order(as.numeric(A10[[i]]$Date)),] #sort by Date in each group

A10[[i]]$INC<-with(A10[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price

}

Comparison:

  1. Both esProc and R language can achieve the task. esProc only uses loop function in computing, achieving high performance and concise code. R language requires writing code manually by using for statement, which brings poor performance and readability.
  2. To complete the task, two layers of loop are required: loop each stock, and then loop each record of the stocks. Except being good at expressing the innermost loop, loop function of R language (including lambda syntax) hasn’t built-in loop variables and is hard to express multi-layer loops. Even if it manages to work out the code, the code is unintelligible.

Loop function of esProc can not only use “~” to represent the loop variable, but also be used in nested loop, therefore, it is expert at expressing multi-layer loops. For example, A10.(~.sort(Date)) in the code is in fact the abbreviation of A10.(~.sort(~.Date)). The first “~” represents the current stock, and the second “~” represents the current record of this stock.

  1. As a typical ordered operation, it is required that the closing price of last day be subtracted from the current price. With the useful built-in variables and operators, such as #,[] and {}, esProc is easy to express this type of ordered operation. For example, Close-Close[-1] can represent the increasing amount. R language can also perform the ordered operation, but its syntax is much too complicated due to the lack of facilities like loop number, relative position, relative interval and so on. For example, the expression of increasing amount is Close-c(0,Close[- length (Close)]).

It is hard enough for loop function in R language to perform the relative simple ordered operation in this example, let alone the more complicated operations. In those cases, multi-layer for loop is usually needed. For example, find out how many days the stock has been rising:

A10<-split(stock, stock $Code)

for(I in 1:length(A10){

A10[[i]][order(as.numeric(A10[[i]]$Date)),] #sort by Date in each group

A10[[i]]$INC<-with(A10[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price

if(nrow(A10[[i]])>0){  #add a column, continuous increased days

A10 [[i]]$CID[[1]]<-1

for(j in 2:nrow(A3[[i]])){

if(A10 [[i]]$INC[[j]]>0 ){

A10 [[i]]$CID[[j]]<-A10 [[i]]$CID[[j-1]]+1

}else{

A10 [[i]]$CID[[j]]<-0

}

}

}

}

The code in esProc is still concise and easy to understand:

A10=stock.group(Code)

A11=A10.(~.sort(Date))

A12=A11.(~.derive((Close-Close[-1]):INC), if(INC>0,CID=CID[-1]+1, 0):CID))

Summary: In performing multi-layer loops or inter-rows and -groups operations, esProc loop function has higher computational performance and more concise code.

Posted in esProc/R/Python/Perl, FAQ | Tagged , , , | Leave a comment