For a while I have used a variety of relational SQL databases with R, such as PostgreSQL, MySQL, and SQLite. A recent project introduced MongoDB, a popular NoSQL database that stores and indexes JSON-style documents. The data is semi-structured with a rapidly changing collection of fields in the documents.
In R, there are two packages RMongo
and rmongodb
for accessing
MongoDB databases. I chose rmongodb
since it uses a native C driver
instead of the Java based RMongo
. The rest of this article assumes
some knowledge of R and MongoDB.
First, begin by connecting to the MongoDB database.
library(rmongodb)
mongo <- mongo.create(host="mongosrv",
username="me",
password="pass",
db="mydb")
Define the subset of fields to use for querying and retrieval.
An R list is a convenient data structure for the representation,
and several rmongodb
commands can use lists directly or
create JSON and BSON from the lists.
myfields <- list( name = 1L,
address = 1L,
closing_date = 1L )
Very important for MongoDB is to create and use an index that includes the fields to retrieve. During data load, create an index
mongo.index.create(mongo, "mydb.homepurchases", myfields)
Get the number of records to be retrieved.
n <- mongo.count(mongo, "mydb.homepurchases")
For the query, it is tempting to use the mongo.find()
function
using the standard collection of parameters. Make sure to remove
the _id
field in the return unless it is part of your index.
fields <- list( name=1L, closing_date=1L, _id=0L )
cursor <- mongo.find(mongo,
"mydb.homepurchases",
query=list(),
sort=list(closing_date=1L),
fields = fields)
Even with a covering query, this can be slow because the index might
be ignored. Instead, instruct R to set a query hint with the index
to use. This seems to require building a more complicated description
of the query, and to include the sort order in the query instead of passing
it as a parameter. Also in the final successful version, the fields
are encoded in BSON before passing on to mongo.find()
.
query <- list("$query" = list(),
"$orderby" = list(closing_date=1L),
"$hint" = myfields )
query <- mongo.bson.from.list( query )
fields <- list( name=1L, closing_date=1L, _id=0L )
fields <- mongo.bson.from.list( fields )
cursor <- mongo.find(mongo,
"mydb.homepurchases",
query=query,
fields=fields)
When retreiving the results, particularly a large number of them, don't
use mongo.find.all()
, which is very slow due to its implementation.
Although ugly, preallocate space for the results and use
rmongodb
s low level functions. Of course, you can wrap this
in a function of your own within your own application.
names <- character(n)
dates <- character(n) # Use a date class if desired
i <- 0
while(mongo.cursor.next(cursor)) {
i <- i+1
val <- mongo.cursor.value(cursor)
tmp <- mongo.bson.value(val, "name")
names[i] <- ifelse(length(tmp) > 0, tmp, NA)
tmp <- mongo.bson.value(val, "closing_date")
dates[i] <- ifelse(length(tmp) > 0, tmp, NA)
}
mongo.cursor.destroy(cursor)
mongo.destroy(mongo)
df <- data.frame(names=names, dates=dates)
Using the hint in the query sped up the query for over 30,000 records from over 30 seconds to about 0.5 seconds. Preallocating data and using the low level routines to gather results cut the time for data storage by about a factor of 2.
Once this was accomplished, it was possible to include queries to MongoDB in an interactive Shiny app without serious lag.