Functions of esProc/R/Python/Perl in Structured Data Process by Comparison: Chapter2. Import from Text to 2D Table

The data will be imported into 2D table from a format-based text file, and stored in relevant objects provided by the product.

esProc

=file.import@t(Fi:type,…; s,b:e)          

The text file is imported into the table sequence with one statement; Fi specifies the columns to be imported; type defines the data type for each column; s is used to assign the column separator; b and e respectively specify the numbers of bytes in start and end positions; when using @z option (esProc uses “@ + characters” to indicate the function option, which is very convenient), b and e may also be separately used as segment number and total number of segments, in order to split a text file into several segments for parallel process by an easy way; the @t option is set to use the first line as the column name, and generate it automatically.

esProc’s data-importing mechanism is to read data by bytes, and can handle incomplete lines automatically, following the beginning-omitted and end-completed rule. When import from the middle part of the text file, the efficiency is better than that of retrieving by lines.

Perl

  1. The first method is to read the data by lines:

open(INFILE, ‘inputfile.txt’)               #Open the file inputfile.txt

while(<INFILE>)                                    #Read the data line by line

{

    @row=split(“\t” , trim($_));            #Use Tab separator to split the line into the array

    push(@inputFileArray,[@row]);                #Store the array pointer in the array inputFileArray to form a 2D table; where, []represents array reference (pointer); as given in importing the data table, there is no column name, so that it is required to note the column number manually

}

2. The second method is to read the file by bytes, and process the incomplete lines automatically:

 open(INFILE, ‘inputfile.txt’)                        # Open the file inputfile.txt

my($buffer) = “”;

$prefix=””;

while(read(INFILE, $buffer, 1024))           # Read 1024 bytes every time

{

    $tmp=$prefix.trim($_);                      # Add the end line read in last time

    @rows=split(“\n”,$tmp);                  # Use a line break to split a line

    $i=0;

    foreach $value(@rows){

        if($i==@rows-1){                         #Retain the last line in order to add it to the next loop, thus to avoid there is an incomplete line

             $prefix=$value;

             last;

        }       

        @row=split(“\t” , trim($value));      # Use tab separator to split the line into the array

        push(@inputFileArray,[@row]);               

        $i+=1;

    }

}

@row=split(“\t” , trim($prefix));                      # Fill up to the last line

push(@inputFileArray,[@row]);

Python

Both Python and Perl have no existing class and method available for importing regular data of txt file into 2D array, you have to program by yourself to read and parse the data, and finally to store the data as a 2D array. The source code comes here as follows:

1. The first method is to read the data by lines:

#!python

myfile = open(“inputfile.txt”)            # Open the file handle

rs=[]

for line in myfile.readlines():             #Read the data line by line

        row=line.strip().split(‘\t’)          #strip() to delete the end of line\n, split to split the string into array with \t 

        rs.append(row)                           #Add a row array to array rs so that a 2D array is formed

myfile.close()

  1. The second method is to read the data by bytes

#!python

import string 

myfile = open(“d:/T21.txt”,’r’)

BUFSIZE = 1024               

lines = myfile.readlines(BUFSIZE)

value=0 

while lines: 

for line in lines:

           tmp=line.split(‘\t’)

           value+=int(tmp[6])

lines = myfile.readlines(BUFSIZE) 

myfile.close()

It is sure that the results returned by the second method are all full line data, and the program will make up to the last line automatically. In most cases, the bytes of returned data will be slightly greater than that value specified by BUFSIZE (except the case when the function readlines(BUFSIZE) is called last time). In normal conditions, Python will adjust the user-specified BUFSIZE value to integer multiples of the size of the internal cache.

R

As easy as esProc, R can load data into the data frame object directly.

data<-read.table(“d:/T21.txt”)       # Generate the data frame directly

You can specify the inter-column separator, row name, column name and other parameters, see below for details:

read.table(file, header = FALSE, sep = “”, quote = “\”‘”,dec = “.”, row.names, col.names,

as.is = !stringsAsFactors,na.strings = “NA”, colClasses = NA, nrows = -1,skip = 0,

check.names = TRUE, fill = !blank.lines.skip,strip.white = FALSE, blank.lines.skip = TRUE,

comment.char = “#”,allowEscapes = FALSE, flush = FALSE,

stringsAsFactors = default.stringsAsFactors(),fileEncoding = “”, encoding = “unknown”)

file: To specify the file to be read or the path of file;

header: Whether to retrieve the column name? The default is not to do;

sep: To specify the separator: data to be imported is required to include regular separators, which may be space, TAB, line break, Enter;

as.is: Whether the character read is converted into the factor? The default is Yes;

colClasses: To specify data type format for each column

header=TRUE: Whether the first line is column name? The default is TRUE

stringsAsFactors=: Whether the character is converted into the factor? The default is true

row.names=c(): To specify a name for each row

col.names=c(): To specify a name for each column; if no header is found when reading the file, it can be specified

skip=N: From which row of the file can the data be read

nrows=N: Maximum number of rows read

na.strings=c(): What kind of character may be specified to indicate that the value is missing

comment.char=‘’: To specify start character for the comment, the default is #

dec=: To specify the precision of number

encoding=: To specify the non-ASCII code rule

R also supports to import such file whose data has no separator but fixed column width, for example:

data<-read.fwf(“d:/T21.txt”,widths=c(2,3))   # It indicates the first column width is 2 characters, and the second column is 3        

Likewise, read.fwf also has a variety of parameters as follows:

read.fwf(file, widths, header = FALSE,sep=”\t”, skip = 0, row.names, col.names, n = -1,

buffersize = 2000)

file:To specify the file to be read, or the path of the file;

widths: To specify the segment width, which could be a vector to specify different widths;

buffersize: Buffer size, whose unit is the number of rows;

n:To read the number of data rows, the default is full read;

Conclusions:

2014-06-10_110523

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Structured Data Process and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s