Functions of esProc/R/Python/Perl in Structured Data Process by Comparison :Chapter 19.Access to Large File

esProc

The concept of file cursor is available in esProc, which can be used to perform a series of operations on large file, for example, sorting, filtering, grouping, connecting, aggregating data, etc.. You can select appropriate option to perform its operation process within the memory, or to store a certain amount of data that the memory cannot hold in the file, in order to solve the problem of insufficient memory.

  =file(“d:/T21.txt”).cursor@p(#1:long)

  =A1.groups(;sum(#1))

The above code is an operation on 1G file, which takes 26 seconds

Perl

  1. To modify the large file

There is a Tie::File module in the Perl, which is ideal for modifying the large file.

This module uses Perl’s array to represent a file, where each line corresponds to one element of the array; Line 1 is element 0, Line 2 is 1, ….Any change made in the array will be synchronized to the file.

The file is not loaded into the memory at the outset, but to read how much you use each time, for example, to print Line 13, the system will only read the first 13 lines into the memory, and then print out Line 13, rather than read the entire file into the memory.

To count the number of lines in the file, the system will traverse the entire file line by line, and finally get the number of its lines, no need to read the entire file into the memory. This can release too much memory space to a certain extent.

The best benefit of this class is that you will feel free to specify a few lines to be processed at the beginning and at the end of the file.

Some problems are as follows:

  1. If you want to print out the lines rearward, even the last one of the file, the system will only read the most part or the entire file into the memory, taking up too much memory space.
  2. When the file is too large, the system will fail to bind the array and file, so that all the relevant operations will be skipped. Perhaps, the specific reason is that the length of the array is limited? Too many lines exist in the file, or so long array cannot be supported?
  3. To evaluate the column-based sum for the file, namely, the records are all traversed, this method runs very slowly. As I have tested, it will take more than 1,800 seconds to evaluate the sum of one column after 1G file is traversed. Therefore, this module is only available for the case when a part of data in the file requires to be modified, but not for the case when the file is loaded to make a summary.

The sample code is given as follows:

use Tie::File;

  tie @array, ‘Tie::File’, “c:/a.txt”;     Bind the array @array with txt file

  $array[13] = ‘blah’;          # The content of Line 14 in the file is changed to ‘blah’

  print $array[42];              # Print the content of Line 43 in the file

  $n_recs = @array;                    # Get the total number of lines in the file

  $#array -= 2;                     # Delete the two lines at the end of file

  1. To load the large file

The other practice for Perl file is fit to load the file stream in order to make a summary. The sample code is given as below:

  open(FILE_IN,”d:/T21.txt”);

         $value=0;

         while(defined($perIns = <FILE_IN>))

         {

                   @row=split(“\t” , $perIns);

                   $value+=int($row[0]);

         }

The above code is an operation on the 1G file, which takes 50 seconds

Python

When Python loads the file, you can specify the size of buffer to load a certain size of data once, and the last line can also be filled up, no half line occurred. For a large file, in doing so, the memory overflow can be avoided. The sample code is given as follows:

  #!python

  import string

 

  myfile = open(“d:/T21.txt”,’r’)

  BUFSIZE = 1024               

  lines = myfile.readlines(BUFSIZE)

  value=0 

  while lines: 

         for line in lines:

                   tmp=line.split(‘\t’)

                   value+=int(tmp[6])

         lines = myfile.readlines(BUFSIZE) 

  myfile.close()

By the above method, a 1G file is loaded and evaluated for the sum of one field, which takes 44 seconds.

R

Loop load will be executed for 1G file, and evaluate the sum of one column, which takes 9 min. and 47 seconds.

  con<- file(“d:/T21.txt”, “r”)

  lines=readLines(con,n=1024)

  value=0

  while( length(lines) != 0) {

         for(line in lines){

                   data<-strsplit(line,’\t’)

                   value=value+as.numeric(data[[1]][1])

         }

         lines=readLines(con,n=1024)

  }

  print(value)

  close(con)

For large amount of data summarizing, R runs significantly slower than other languages. For the same 1G file, with same summarizing logic, esProc takes 26 seconds, Perl 50 seconds, Python 44 seconds.But R takes 9 minutes and 47 seconds,which differs from others by more than one order of magnitude. If loading the data as the data frame, the speed will be slower. It will take over 11 minutes to do the job.

esproc_r_perl_python_19

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in esProc/R/Python/Perl, Structured Data Process and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s