Functions of esProc/R/Python/Perl in Structured Data Process by Comparison :Chapter 14.Group


esProc provides a variety of grouping functions, by which, the grouping and aggregation processes can be performed by two steps or by one step. In the normal conditions, the grouping can be done by the column, expression, or by the rule of enumeration. For example:,col2)        //Group by col1, col2, not to aggregate, and return the grouped set

=tbl.groups(col1,col2;sum(col3),count(1))      //Aggregate while grouping, and return the aggregated result                       //Return distinct values of Column class, it is similar to the distinct of sql

=tbl.enum([“Class 1″,”Class 2″,”Class 3”],class)   //Column class is grouped by [“Class 1″,”Class 2″,”Class 3”] order, and return the grouped set


@carts is a 2D array. Now it is necessary to group it by Column 2 of 2D array, and then aggregate the values of Column 3 after grouped, the code is given as below:


    @carts=([173744,”aaa”,14],           #Define a 2D array



    %groups=();                        #Define an associative array used to store grouped results

    foreach(@carts){              #The arrays starts to be cycled , and add the records to grouped results row-by-row

        $name = $_->[1];

    if($groups{$name} == null){     #Determine if the current group exists

    $groups{$name}=[$_];     # If not, create a new group, and remember that only [] can be used to indicate the reference here



           push($groups{$name},$_);      # If the group exists, add the record to it directly



    my @result=();                            #Define a 2D array used to store the groups’ summary results

    foreach( keys(%groups)){         # A summary is done for each subgroup after grouped


    while($row=pop $groups{$_}){        #Here the direct foreach operation is disabled for $groups{$_}, since it is an array pointer, as an individual value other than an array per se. Therefore only pop can be used, it is likely to mislead the programmer in a wrong way. 

         $value += $row->[2];                 # Here the symbol -> cannot be omitted


    push @result,[$_,$value];


The above code can also be written as:

    @carts=( ….);

    my %groups;

    push @{$groups{$_->[1]}, $_ foreach @carts;


    my @result;

    foreach my $gp ( keys(%groups) ){

    my $value=0;

    $value += $_->[2] foreach @{$groups{$gp}};

    push @result, [ $gp, $value ];


The above two methods of code writing are essentially the same in summary part. In the second method, the inner loop is only merged into a single line and the code becomes shorter. Their key difference lies in grouping. In the first method, $_->[1] is singled out to assign value to $name. Because $groups{$name} is used many times in the subsequent code, the code will become cumbersome if $groups{$_->[1]} is usedeach time. Besides, the first method makes a judgementonnull. When $groups{$name} == null, error report will appear if push($groups{$name},$_) is executed directly. This judgement is omitted in the second method, so it is unnecessary to single out $_->[1] to assign value. Thus the whole loop can be merged into a single line to make code shorter. In practice, the null pointer is sure to cause an exception.

So far, there is no available module in Perl that can be used to perform the group and aggregate operation on the 2D array. Programmer has to encapsulate it by himself. As we can see from the above example, the grouping operation will take more works with relatively long code, but the syntax of involved pointer will make the programmer error prone.


Groupby function is provided in itertools package of python. The records can be divided into grouped sets by fields, and then you write a loop program to aggregate. This method saves the amount of codes of grouping. However, as the group-summary process is split into two steps, and if grouped result need not to be reused, this method runs slowly instead.

And compared with esProc, groupby function shows more cumbersome. In many cases, you still have to write the code for grouping the array. Two grouping methods are given as below: one is to group by your own codes, the other is to use groupby function.

Method 1:

  from functools import reduce

  from operator import add


  data = [                               #A 2D array

  [121.1, 0.02, 0.02],

   [121.1, 0.05, 0.05],

   [122.1, 0.56, 0.6],

   [122.1, 3.79, 4.04],

   [123.1, 93.75, 100.0],

   [123.1, 0.01, 0.01],

   [124.1, 0.01, 0.01],

   [124.1, 1.01, 1.08],

   [124.1, 0.11, 0.11],

   [124.1, 0.05, 0.06],

   [125.1, 0.39, 0.41],


  def reduce_callback(d, i):         #Define a grouping function with the returned result of dictionary, which is similar to map

  key = i[0]  


  d[key] = list(map(add, d[key], i[1:]))         #Directly make a summary, and the grouping and aggregation processes are completed by one step

  except KeyError:

  d[key] = i[1:]

  return d    


  result = reduce(reduce_callback, data,{})        #Execute the grouping function, and return the dictionary

  result_list = [[k]+v for k,v in result.items()]     #The dictionary is converted back to the 2D array


Method 2:

  from itertools import groupby

  from operator import itemgetter

  data = [                               #A 2D array

  [121.1, 0.02, 0.02],

   [121.1, 0.05, 0.05],

   [122.1, 0.56, 0.6],

   [122.1, 3.79, 4.04],

   [123.1, 93.75, 100.0],

   [123.1, 0.01, 0.01],

   [124.1, 0.01, 0.01],

   [124.1, 1.01, 1.08],

   [124.1, 0.11, 0.11],

   [124.1, 0.05, 0.06],

   [125.1, 0.39, 0.41],



  for key, items in groupby(data, itemgetter(0)):       #Use groupby function to group directly



  forsubitem in items:                                     #Perform aggregation cyclically



  result.append([key,value1,value2]) #Finally, the aggregated results are appended to the 2D array




group<-aggregate(data[,1],list(substr(data[,4],1,2)),sum)    #Group by substr(data[,4],1,2) in order to aggregate

#Evaluate the sum of the data[,1]field

For the summary function sum here, only one method is specified here once. If you want to simultaneously evaluate the sum and mean of fields, no matter how to write, the error will occur. No helpful way by searching everywhere, but I believe that R can provide a support for it. This only remains to be surveyed.



About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in esProc/R/Python/Perl, Structured Data Process, Uncategorized and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s