Who needs a database when you have R?

<< Software architecture, PHP and Javascript

August 22, 2011

Who needs a database when you have R?

This is really nice; using the following two statements you can read a comma separated file into a variable and run a sql query over the data:

stats <- read.table("stats.log",header=FALSE,sep=";"
    ,col.names=c("id","ip","timestamp","url","time"))
callsPerUrl <- sqldf(
  "select distinct(url), count(ip) from stats group by url")

No need to create a database, just a simple script that you can start on the commandline using R.

Recently i needed to calculate some statistics on log files created during a performance test. The logfiles are too big for MS-Excel, one option was to load them into MS-access, or some other database.

However, a co-worker told be about R. I thought this might be a good opportunity to try it. If you know a bit about sql it’s pretty easy to get started.

Here’s an example log file i’m going to process:

"1";"192.168.1.1";2011-08-12 11:47:25;"/search";2.997
"2";"192.168.1.2";2011-08-12 11:47:29;"/book/ISBN/1234";4.6
"3";"192.168.1.1";2011-08-12 11:47:31;"/searchresults";1.383
"4";"192.168.1.2";2011-08-12 11:47:46;"/searchresults";10.536

You can read this log file into an R data-table using the following statement:

library(data.table)
statsTable <- data.table(read.table("stats.log",
    header=FALSE,sep=";",
    col.names=c("id","ip","timestamp","url","time")))
setkey(statsTable,url,ip)
statsTable

If you run this using R you get the following result:

     id          ip           timestamp             url   time
[1,]  2 192.168.1.2 2011-08-12 11:47:29 /book/ISBN/1234  4.600
[2,]  1 192.168.1.1 2011-08-12 11:47:25         /search  2.997
[3,]  3 192.168.1.1 2011-08-12 11:47:31  /searchresults  1.383
[4,]  4 192.168.1.2 2011-08-12 11:47:46  /searchresults 10.536

The following statements calculates the average time it takes to serve an url:

avgPerUrl <- statsTable[,list(avg=round(mean(time),2)),by=url]
avgPerUrl

The result:

                 url  avg
[1,] /book/ISBN/1234 4.60
[2,]         /search 3.00
[3,]  /searchresults 5.96

Now, the best part for those of us who are more familiar with sql. Just use SQL:

library(sqldf)
callsPerUrl <- sqldf(
    "select distinct(url), count(ip) from statsTable group by url")
callsPerUrl

The result:

              url count(ip)
1 /book/ISBN/1234         1
2         /search         1
3  /searchresults         2

Nice. One statement to load a file into a data table, and another statement to query the data. No need to start an RDBMS server, create tables, import data, just a simple script. And it gets better from here: simple statements to plot graphs, and lots of libraries to do complex analysis…

Andrej Koelewijn

Who needs a database when you have R?

Related