This is really nice; using the following two statements you can read a comma separated file into a variable and run a sql query over the data:
stats <- read.table("stats.log",header=FALSE,sep=";"
,col.names=c("id","ip","timestamp","url","time"))
callsPerUrl <- sqldf(
"select distinct(url), count(ip) from stats group by url")
No need to create a database, just a simple script that you can start on the commandline using R.
Recently i needed to calculate some statistics on log files created during a performance test. The logfiles are too big for MS-Excel, one option was to load them into MS-access, or some other database.
However, a co-worker told be about R. I thought this might be a good opportunity to try it. If you know a bit about sql it’s pretty easy to get started.
Here’s an example log file i’m going to process:
"1";"192.168.1.1";2011-08-12 11:47:25;"/search";2.997
"2";"192.168.1.2";2011-08-12 11:47:29;"/book/ISBN/1234";4.6
"3";"192.168.1.1";2011-08-12 11:47:31;"/searchresults";1.383
"4";"192.168.1.2";2011-08-12 11:47:46;"/searchresults";10.536
You can read this log file into an R data-table using the following statement:
library(data.table)
statsTable <- data.table(read.table("stats.log",
header=FALSE,sep=";",
col.names=c("id","ip","timestamp","url","time")))
setkey(statsTable,url,ip)
statsTable
If you run this using R you get the following result:
id ip timestamp url time
[1,] 2 192.168.1.2 2011-08-12 11:47:29 /book/ISBN/1234 4.600
[2,] 1 192.168.1.1 2011-08-12 11:47:25 /search 2.997
[3,] 3 192.168.1.1 2011-08-12 11:47:31 /searchresults 1.383
[4,] 4 192.168.1.2 2011-08-12 11:47:46 /searchresults 10.536
The following statements calculates the average time it takes to serve an url:
avgPerUrl <- statsTable[,list(avg=round(mean(time),2)),by=url]
avgPerUrl
The result:
url avg
[1,] /book/ISBN/1234 4.60
[2,] /search 3.00
[3,] /searchresults 5.96
Now, the best part for those of us who are more familiar with sql. Just use SQL:
library(sqldf)
callsPerUrl <- sqldf(
"select distinct(url), count(ip) from statsTable group by url")
callsPerUrl
The result:
url count(ip)
1 /book/ISBN/1234 1
2 /search 1
3 /searchresults 2
Nice. One statement to load a file into a data table, and another statement to query the data. No need to start an RDBMS server, create tables, import data, just a simple script. And it gets better from here: simple statements to plot graphs, and lots of libraries to do complex analysis…