Performing Group Operations on Text-based Tabular Data in JAVA

The group operations performed on tabular data generated from text files include algorithms like grouping and aggregation, obtaining distinct values, group merging and so on, which can be realized using basic JAVA class libraries. But JAVA provides only limited support for the structured-data computing, generating complicated and inefficient code for knotty cases.

Yet it’s easy to perform group operations on tables imported from text files by using esProc as the JAVA class library for handling structured data. The ways esProc works include independent operation and receiving invocation from either the command line or a JAVA application. More information can be found in the article How esProc Implements Text Processing.

Here follows some typical scenarios about group operations operated on text-based tabular data and offers their solutions in esProc.

Simple grouping and aggregation

The sales.csv file stores sales orders. Requirement: Group data by SellerId to calculate the total sales amount per seller and write the results to a new file. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

esProc code:

A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.groups(SellerId;sum(Amount))
3 =file(“e:\\result.txt”).export@t(A2)

Explanation: First import the source file as a structured two-dimensional table; then perform the group and aggregate.

By default The import function imports all fields using tab as the separator; @t means importing the first row as the field names. The groups function can perform group and aggregate on a two-dimensional table. The export function writes the two-dimensional table to a file. To write the tabular data to another data table having the same structure, use the code OracleDB.update(A2,Tsales).

Here’s the result:

esProc_at03a_1

Extended information:

  1. If the source data has been ordered by SellerId, we could use @o to speed up the data handling by only comparing the adjacent records with the code =A1.groups@o(SellerId;sum(Amount)).
  2. esProc allows for grouping data according to multiple fields and the implementation of various aggregate algorithms. To calculate the sales amount and count the orders for each seller per year, for example, we could use this code:groups(SellerId,year(OrderDate);sum(Amount),count(OrderID)).

Here’s the result:

esProc_at03a_2

  1. When the source file is too big to be entirely loaded into the memory but the aggregate result would be smaller (this is what usually happens), we can import the file as the cursor and perform the aggregate with the group function, thus the code will be file(“D:\\sales.txt”).cursor@t().groups(SellerId;sum(Amount)).
  2. If the aggregate result set isn’t small enough to be completely put into the memory, we could carry out the aggregation using groupx Unlike the group function, groupx returns a cursor.

Duplicate-related operations

This scenario is about removing the duplicate rows in the Sales.csv file, that is, for two rows with the same OrderID, only keep the first one. Below is a selection of the source file:

OrderID   Client        SellerId     Amount    OrderDate

1       WVF Vip 5       440.0        2009-02-05

1       WVF Vip 5       1040.0      2009-02-04

1       UFS Com          5       1040.0      2009-02-03

2       UFS Com  13     1863.4      2009-07-05

2       UFS Com  13     1863.4      2009-07-05

3       SWFR        2       1813.0      2009-07-08

4       JFS Pep     27     670.8        2009-07-08

5     DSG       15    3730.0    2009-07-09……

esProc code:

A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.group@1(OrderID)

Explanation: Group the imported data by OrderID and get the first record of each group without aggregation. The group function is able to group data without performing the aggregate; @1 means getting the first row of each group and returning the results as a new two-dimensional table.

Here’s the result:

esProc_at03a_3

Extended information:

  1. It’s a more practical practice to retain records according to a certain condition. To keep the record with the earliest date in each group, use the following code: group(OrderID).(~.maxp(-OrderDate)).

Here’s the result:

esProc_at03a_4

To find the first of each group of records sorted by OrderDate in ascending order and by Amount in descending order, we can find the records with earliest OrderDate and then find the one from them with the greatest amount. The code is A1.group(OrderID).(~.maxp([-OrderDate,Amount])).

If there are multiple records satisfying the condition and we want them all, use A1.group(OrderID).conj(~.maxp@a([-OrderDate,Amount])) .

To get the top two from the multiple eligible records, use

A1.group(OrderID).conj(~.top(2,OrderDate,-Amount)).

  1. Sometimes it is the duplicate records that we want, then get them with group(OrderID).select(~.len()>1).conj().

Here’s the result:

esProc_at03a_5

  1. Sometimes we simply want the distinct values of a certain field. To get the client list, for instance, use id(Client). The result:
  2. esProc_at03a_6

The id function performs the distinct operation, so it is the equivalent of A1.group@1(Client).(Client).

  1. Other times we want all the rows that have different field values, instead of filtering the records by ID. In those cases we do it with

A1.group@1(OrderID,Client,SellerId,Amount,OrderDate)

The result:

esProc_at03a_7

Without intending further structured-data computation, we can import the file as a set of rows and perform distinct operation using file(“D:sales.csv”).import@tsi().id(). This is what we get:

esProc_at03a_8

Concatenating strings by groups

The sales.csv file stores sales orders, based on which we want a table of client lists having three fields – year, month and comma-separated client list. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

esProc code:

A
1 =file(“D:\\sales.cvs”).import@t()
2 =A1.group(year(OrderDate),month(OrderDate);~.(Client).string())

Explanation: Group the imported data by the two parts of OrderDate, get the set of clients in each group and convert them to comma-separated strings. Here’s the result:

esProc_at03a_9

Intra-group alignment

The empContact.csv file has six columns, the first four of which contain employee information where duplicate values exist; the fifth column holds the ways of contact and the last column is the specific contact information. Each employee has at least one way for contact. Now we want to merge the records to make a new table where the fifth, sixth and seventh columns are work phone, cell phone and work email. Below is source file:

21458952,John,technology,support staff,work phone,555-555-5555

21458952,John,technology,support staff,work email,johndoe@whatever.net

21458952,John,technology,support staff,cell phone,233-1132-9762

99946133,Jane,technology,administration,work phone,444-444-4444

99946133,Jane,technology,administration,work email,janepaul@whatever.net

99946133,Jane,technology,administration,cell phone,121-3433-9788

99946133,Jane,technology,administration,home phone,431-443-8434

esProc code:

A
1 =file(“D:\\empContact.csv”).import(;”,”)
2 =A1.group(#1,#2,#3,#4;~.align([“work phone”,”cell phone”,”work email”],#5):g)
3 =A2.new(#1,#2,#3,#4,g(1).#6,g(2).#6,g(3).#6)

Explanation: Group the imported records by the first four columns and align records in each group according to the order of work phone\cell phone\work email (the duplicate values will be automatically filtered away); then build a new two-dimensional table by sequentially retrieving the records. Here’s the result:
esProc_at03a_10

Grouping by sequence numbers

In the following file, every three lines correspond to one record. For example, the first record is 26\TAS\2009-08-05. We want to rearrange it into a two-dimensional table. Here’s the source data:

26

TAS

2009-08-05

33

DSGC

2009-08-14

84

GC

2009-10-16

esProc code:

A
1 =file(“D:\\data.txt”).import@si()
2 =A1.group((#-1)\3)
3 =A2.new(~(1):OrderId,~(2):Client,~(3):OrderDate)

Explanation: Import the file as a sequence, during which @s means that fields won’t be extracted; group the sequence every three lines, during which “#” represents the sequence number of each line and “\” represents integer division; finally create a table sequence based on the groups, during which ~(1) is the first member of the current group.

The result:

esProc_at03a_11

Extended information: If the file has a complicated format. For example, there are three fields in the second line of each group:

26

TAS   1       2142.4

2009-08-05

33

DSGC        1       613.2

2009-08-14

84

GC    1       88.5

2009-10-16

In this case, we just need to modify A3 into this : =A2.new(~(1):OrderID, (line=~(2).array(“\t”))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate )

The result:

esProc_at03a_12

Intra-group order-related calculation

postTurn.txt records the schedule of rotating shifts for a post. The file is grouped by dates and some persons work consecutive shifts. We want to find out how many times the two neighboring names work successive shifts. The fields would be first, next and times. Below is the source file:

#2015-06-01

Ashley

Matthew

#2015-06-02

Ashley

Matthew

Matthew

#2015-06-03

Ashley

Ashley

Matthew

Ryan

#2015-06-04

Ashley

Ryan

Ryan

Matthew

#2015-06-05

Jessica

#2015-06-06

Ashley

Jessica

Matthew

Ashley

Jessica

Jessica

Ryan

esProc code:

A
1 =file(“D:\\postTurn.txt.txt”).import@i()
2 =A1.group@i(pos(~,”#”))
3 =A2.conj(~.([~[-1],~]).to(3,))
4 =A3.group(~).new(~(1)(1):first,~(1)(2):next,count(~):times)

Explanation: Group the imported data according to the condition that whether a row contains “#” or not, and each group is a day’s work shift schedule; for each group, put every two neighboring names into a subset and concatenate all the subsets; group each concatenated set every two members, get the first name and the next name from each group and count how many times they appear together. Here’s the result:

esProc_at03a_13

Inserting data between groups

In the sales.txt file, the value of SellerId changes every multiple rows. We want to export the OrderId field and add the string “Begin” every time a group of records with new value begins and the string “End” after a group ends. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

esProc code:

A
1 =file(“D:\\sales.csv”).import@t()
2 =A1.group@o(SellerId).conj(“Begin”|~.(OrderID)|”End”)

Explanation: Write the code directly according to what it asks. @o enables grouping by comparing the adjacent records. To retrieve a certain field from a group of records, use ~.(OrderID). The operator “|” is used to concatenate sets or members of a set. The conj function handles subsets respectively and then concatenates the results.

Here’s the result:

esProc_at03a_14

Extended information:

  1. Stop using @o when data is unordered, but the performance will decrease a little.
  2. To add an aggregate value, the number of records in each group, for example, write A2 as group@o(SellerId).conj((“Begin: “+string(~.count()))|~.(OrderID)|”End”)

Here’s the result:

esProc_at03a_15

  1. To add strings to both ends of a group of records and to append them to a new file in order, use the following code:
A B
1 =file(“D:\\sales.csv”).import@t() =file(“d:\\result.txt”)
2 for A1.group@o(SellerId) =B2.export@a([“begin”])
3 =B2.export@a(A2)
4 =B2.export@a([“End”])

The result:

begin

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

End

begin

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

End

begin

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

End

begin

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

End

Sampling by groups

The sales.txt file stores ordering data, and we want to retrieve one record from each seller’s records. Here’s the rule: Find out records whose OrderDates are fewer than 60 days after their previous ones, and retrieve one randomly from these sampled records; if there is an insufficient number of records satisfying the condition, just get the record that has the smallest OrderDate. Besides, it is specified that the second record will always appear in each group. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

144  WZ   6       86.4 2010-12-23

70     DSG 7       288.0        2009-09-30

131  FOL  7       103.2        2009-12-10

65     YZ     8       29600.0   2009-01-06

esProc code:

A B C
1 =file(“D:\\sales.csv”).import@t()
2 for A1.group(SellerId) =A2.sort(OrderDate).select(interval(OrderDate[-1],OrderDate)<60)
3 if B2!=[] =B1=B1|B2(rand(B2.len())+1)
4 else =B1=B1|B2(1)

Explanation:

Group the imported records by SellerId and loop through the groups. For each loop, first sort the group by OrderDate and then take the sample as required. If there’re enough records sampled, retrieve one randomly and append it to B1; if the number of records meeting the condition is insufficient, then get the first record of this group. OrderDate[-1] represents the previous record relative to the current one. Below are the results of three sampling:

esProc_at03a_16

Data grouping, splitting and export

The sales.txt file stores a large number of sales orders, which we want to divide into multiple files according to both the year and the month. The format of the file name will be “year-month.txt”. Below is the source file:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

39     NR    3       3016.0      2010-08-21

43     KT     3       2169.0      2009-08-27

71     JFE   3       240.4        2010-10-01

99     RA    3       1731.2      2009-11-05

esProc code:

A
1 =file(“D:\\ sales.txt”).import@t()
2 =A1.group(string(OrderDate,”yyyy-MM”);~)
3 =A2.run(file(“d:\\temp\\”+#1+”.txt”).export(#2))

Explanation: Group the imported file by the year and the month, parse the data in each group; loop through each group to write it to the file. For instance, the following is the contents of the 2009-01.txt file:

65     YZ     8       29600.0   2009-01-06

62     JAXE          11     8134.0      2009-01-06

64     HP    13     20000.0   2009-01-02

60     PWQ         16     3430.0      2009-01-05

63     SJCH         16     5880.0      2009-01-02

61     SJCH         19     1078.0      2009-01-08

Extended information: When the size of the source data exceeds the memory capacity, we should import the file with the cursor function. If each group of data is still too big to be accommodated by the memory, we can use the groupx function to group the cursor file. The coding structures for both cases are the same.

Handling a big file by groups

The sales.txt file is too big to be held by the memory, thus we need to filter out the records of the clients whose total order amount is greater than 10,000, and write them to another file. The source data is already ordered by SellerId, as shown below:

OrderID   Client        SellerId     Amount    OrderDate

26     TAS   1       2142.4      2009-08-05

33     DSGC        1       613.2        2009-08-14

84     GC    1       88.5 2009-10-16

133  HU   1       1419.8      2010-12-12

32     JFS    3       468.0        2009-08-13

esProc code:

A B C
1 =file(“D:\\sales.csv”).cursor@t()
2 for A1;SellerId if A2.sum(Amount)>=10000 =file(“e:\\result.txt”).export@at(A2)

Explanation: Loop through the cursor by SellerId and import a group of data into the memory at a time; sum up the amounts for the current group and write the records to the new file if the sum satisfies the condition. Here’s the result:

esProc_at03a_17

If the final result set is relatively small, we can append it to a cell (B1, for instance). In that case C2 should be B1=B1|A2.

If the source data isn’t ordered, we can first sort it using the sortx function, which returns a cursor, too.

Multilevel grouping

The Stock.txt file keeps the information of goods in and out of the warehouse. A same kind of goods may be found in and out several times a day, or there may be nothing in and out during continuous days. The initial inventory of the goods is 0. We use In to represent goods shipped in, and Out for those shipped out. Now we want to calculate the daily inventory for all kinds of goods. Below is the source file:

date name        quantity   flag

2014-04-01       Item1       15     In

2014-04-01       Item1       4       In

2014-04-02       Item1       3       In

2014-04-02       Item1       10     Out

2014-04-03       Item1       3       In

2014-04-04       Item1       5       Out

2014-04-07       Item1       4       In

2014-04-10       Item1       2       Out

2014-04-01       Item2       20     In

2014-04-02       Item3       30     In

2014-04-03       Item3       14     Out

esProc code:

A B
1 =file(“D:\\stock.txt”).cursor@t()
2 =A1.group(name,date;~.select(flag==”In”).sum(quantity):in,~.select(flag==”Out”).sum(quantity):out)
3 =A2.group(name) =periods((t=A2.id(date)).min(),t.max(),1)
4 for A3 =A4.align(B3,date)
5 >c=0
6 =B4.new(A4.name:name,
B3(#):date,
c:Opening,
in,
(b=c+in):Total,
out,
(c=b-out):Close)
7 =@|B5

Explanation: A2 calculates the daily in-and-out quantity for each kind of goods; obtain the complete list of dates according to the earliest and latest dates and store it in B3; group records by items, loop through each group to align the current group of records with B3 and calculate the daily inventory in order. Here’s the result:

esProc_at03a_18

 

Advertisements

About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Java Assistant. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s