Methods of Accessing Excel Files by R language

There are many ways for R language to access Excel files, but each has its weaknesses. For example, xlsx package has complicated code and supports only Excel 2007; RODBC has too many restrictions, is difficult to understand and unstable and goes wrong strangely. Though the method of saving the file in csv format is relatively common and stable, it operates inconveniently and lacks ability to process multiple files with program. Another method is to extract xml, but the complicated steps and code forbid us to use it.It’s also not ideal to transform the file with a clipboard because part of the operation need to be done manually and we’d rather save the file in csv format.

However, all these problems can be avoided if we access Excel files using gdata package and meanwhile, write to Excel with WriteXLS. Both of the two packages support Excel 2003 and Excel 2007, operate stably, have easy and intuitive code and require no manual work. The following example is used to illustrate the method of accessing Excel with the two function packages.

Target:

There are multiple Excel files of same structure in the directory ordersData. Among these files containing sales order over the years, some are in the format of Excel 2007, others are in the format of Excel 2003. Please load them, compute the total sales amount of each client and write the result to result.xlsx. The following is some of the data of 2011.xlsx:

2014-08-28_095121

 

Code:

    library(gdata)                                     

    library(WriteXLS)                       

    setwd(“E: /ordersData”)                  

    fileList<-dir()   

    orders<-read.xls(fileList[1])                                         

    for (file in fileList[2:length(fileList)]){                         

    orders<-rbind(orders,read.xls(file))

    }

    result<-aggregate(orders[,4], orders[c(2)],sum)

    WriteXLS(“result”,”result.xlsx”)                                 

 

Some of the data of result.xlsx are as follows:

 

2014-08-28_095131

Code interpretation:

  • The two lines of code library(gdata) and library(WriteXLS) aim to import two third-party function packages, which have read.xls function and WriteXLS function to read and write Excel respectively.
  • The line of code fileList<-dir() lists all the files in the directory. The following for statement read files by loop and merge data into the data frame orders. If there are other files in the directory, they should be removed using wildcard characters.
  • This line of code result<-aggregate(orders[,4], orders[c(2)],sum) executes grouping and summarizing, in which orders[,4] represents summarizing column (i.e. Amount) and orders[c(2)] represents grouping column (i.e. Client).
  • Both read.xls and WriteXLS support the data type data.frame though they come from different packages, therefore, they can coordinate rather well.Besides, read.xls function can automatically identify the format of both Excel 2003 and Excel 2007, and is quite convenient to use.
  • All the code is concise and easy to grasp for beginners. 

Note for use:

  1. Versions

gdata and WriteXLS are not R language’s built-in library functions, they are the third-party packages needing download and installation. What’s more, both of them require the Perl environment, so it is particularly important to choose an appropriate version. Through our trials, we find that 2.15.0 version of R language gets along well with 2.13.3 version of gdata and 3.5.0 version of WriteXLS. But something may go wrong if they operate with the newest Perl version and an older 5.14.2 version is thus required. Otherwise the following error report will appear:

   Error in xls2sep(xls, sheet, verbose = verbose, …, method = method,  :

  Intermediate file ‘C:\Users\Thim\AppData\Local\Temp\RtmpMHvLZS\file224060624738.csv’ missing!

  1. Performance

gdata and WriteXLS have no problem in accessing small files, but they perform badly in handling bigger files (maybe because of Perl). For example, it takes 8 to 10 minutes to read an Excel file of 8 columns and 200,000 rows. To achieve a better performance, we recommend xlsx function package. But, of course, Excel 2003 will be of no use in this occasion. In fact,xlsx performs just slightly better than gdata does. Therefore, in order to truly improve performance, it is recommended that all Excel files be transferred into 2007 format and xml files in them be uncompressed and data be read through resolving these xml files. 

Alternative methods:

For the problems of version conflicts and poor performance that R language has, we have alternative solutions like Python, esProc, Perl etc. As R language, they can also access Excel files and perform data computing. In the following, we’ll introduce briefly esProc and Python.

esProc integrates the function of accessing EXCEL into its installation package, so it is no need for it to download the extra third-party packages. It can access Excel 2003, Excel 2007, Excel 2010 and even the older versions. Its code is as follows:

 

2014-08-28_095149

 

esProc’s performance is satisfactory. It takes only 20 to 30 seconds for it to read an Excel file of 8 columns and 200,000 rows. 

Python has a rather excellent performance, except that it requires the third-party packages as R language does. Pandasshould have been able to complete the task of accessing xls file easily, but its installation under windows failed (after all, xls files are mainly produced under windows). Finally, we succeeded in performing this operation by using packages of both xlrd and xlwt3. Unfortunately, the two packages support only Excel2003 and produce much more complicated code:

    import xlwt3

    import xlrd

    from itertools import groupby

    from operator import itemgetter

    importos

    dir=”E:/ordersData/”

    fileList =os.listdir(dir)

    rowList = []

    for f in fileList:

    book=xlrd.open_workbook(dir+f)    #open read-only workbook by loop

    sheet=book.sheet_by_index(0)

    nrows = sheet.nrows

    ncols = sheet.ncols

    for i in range(1,nrows):

    row_data = sheet.row_values(i)

    rowList.append(row_data)      #all records are appended to rowList

    rowList=sorted(rowList,key=lambda x:(x[1]))          #sort the data before grouping

    result=[]

    for key, items in groupby(rowList, itemgetter(1)): # group using groupby function

        value1=0

    for subItem in items:value1+=subItem[3]

    result.append([key,value1])              #merge the summarized result into 2D array in the end

    wBook=xlwt3.Workbook()                           # create a new writable workbook

    wSheet=wBook.add_sheet(“sheet 1”)

    wSheet.write(0,0,”Client”)

    wSheet.write(0,1,”Sum”)

    for row in range(len(result)):                     #write data to the file by loop

    wSheet.write(row+1,0,result[row][0])

    wSheet.write(row+1,1,result[row][1])

    wBook.save(dir+”result.xls”)                     #save the file

It is a far more complicated method than R language.

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in esProc/R/Python/Perl, FAQ and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s