Functions of esProc/R/Python/Perl in Structured Data Process by Comparison :Chapter8. 2D Table Summary


  =[1,2,3].sum()                                       // Evaluate the sum for the Sequences

  =tbl.sum(col1)                                      // Evaluate the sum for col1 fields in Table sequence tbl>30).sum(col1)        // Evaluate the sum for the record field col1 for which the col2>30 is true in Table sequence tbl

  =tbl.sum(#3)                                         // Evaluate the sum for Column 3 in Table sequence tbl

For other max, min, count and so on, the aggregation syntaxes are all similar. As we can see, it is far more convenient to make aggregations whether you will do this action by Column Name or by Column Number since there is a concept for the column in the Table sequence.


Summary on a 1D array is quite simple, for example:

  use List::Utilqw/sum max min maxstrminstr/;
  my @array = (10, 20, 30, 40);
  my $sum = sum @array;                  # Get 100

  my $max = max @array;                  # Get 40
  my $min = min @array;                    # Get 10

  my @array = (“Beijing”, “Shanghai”, “Guangzhou”, “Chengdu”, “Nanjing”);
  my $maxstr = maxstr @array;       # Get Shanghai
  my $minstr = minstr @array;         # Get Beijing 

For 2D array, you will end up in trouble. Since there is no concept for the column in 2D array of Perl, in order to summarize on a column, you need to slice by the columns, and then start to do a summary, for example:

  my @array = ([10,11], [20,21], [30,31], [40,41]);

  @array1=map{$array[$_][1]}0..$#array;        #First slice by the columns

  my $sum = sum @$array1;                                # Get 104

The above syntax can be merged into one statement, that is:

  my $sum = sum map{$array[$_][1]}0..$#array;              # Get 104

Although it can be solved by one statement with the map syntax, it is a little too complex to understand and less intuitive and easy to read than for loop. You can also simplify the map syntax as below:

  my $sum = sum map{$_->[1]}@array;            # Get 104

However, here -> cannot be omitted, because $_ is a pointer. 

And if you want to aggregate the records according to specific criterion, e.g., the sum of second field of the record which matches with specific criterion will be evaluated, the syntax is as follows:

  my @array = ([10,11], [20,21], [30,31], [40,41]);

  @array1=map{                 #First, find the elements of Column 2 in the record whose value in Column 1 is greater than and equal to 30





  my $sum = sum @array1;        # Then make a summary

This syntax is not too much trouble, but for the beginner who has not well understood the knowledge in terms of map, it is somewhat long and unintelligible, even less intuitive than for.


Python can achieve a column-based slice on a 2D array with zip function in order to facilitate the column-based summary, which runs the source codes simpler than that of Perl, for example:



    value=sum(a)            #Make a summary on a 1D array


    b,c=zip(*a)                 #Column-based slice, b=[1,3,5,7], c=[2,4,6,8]

    value=max(b)            #Make a summary right after slice


As can be seen from the above source codes, although the use of zip function facilitates the column-based aggregation, this function can only be used to perform the operation on the whole columns other than one or two columns, so that the efficiency of zip function will be lowered when a 2D array is much wider. Therefore, it is suggested that you had better be honest to write a loop statement for making a summary if the 2D array is wider and the columns to be counted are relatively lesser, the reference code is shown as below:



    for i in range(0,len(a)):



In terms of 2D table summary, R is more powerful than other languages. In addition to more extensive summary functions, it also supports to make a summary by the rows, for example:

  rowSums(dd2)                    # Evaluate the sum by the rows

  colSums(dd2)                      #Evaluate the sum by the columns

  rowMeans(dd2)                 #Evaluate the average value by the rows

  colMeans(dd2)                   #Evaluate the average value by the columns

When making the summary by the rows, the premise is that the data types of all columns are all numeric values.

R language features the summary by rows, but for other languages, it can be achieved only by writing its own loop statement.



About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Structured Data Process and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s