Efficient MongoDB queries with R

For a while I have used a variety of relational SQL databases with R, such as PostgreSQL, MySQL, and SQLite. A recent project introduced MongoDB, a popular NoSQL database that stores and indexes JSON-style documents. The data is semi-structured with a rapidly changing collection of fields in the documents.

In R, there are two packages RMongo and rmongodb for accessing MongoDB databases. I chose rmongodb since it uses a native C driver instead of the Java based RMongo. The rest of this article assumes some knowledge of R and MongoDB.

First, begin by connecting to the MongoDB database.

library(rmongodb)
mongo <- mongo.create(host="mongosrv", 
                      username="me", 
                      password="pass",
                      db="mydb")

Define the subset of fields to use for querying and retrieval. An R list is a convenient data structure for the representation, and several rmongodb commands can use lists directly or create JSON and BSON from the lists.

myfields <- list( name = 1L,
                  address = 1L,
                  closing_date = 1L )

Very important for MongoDB is to create and use an index that includes the fields to retrieve. During data load, create an index

mongo.index.create(mongo, "mydb.homepurchases", myfields)

Get the number of records to be retrieved.

n <- mongo.count(mongo, "mydb.homepurchases")

For the query, it is tempting to use the mongo.find() function using the standard collection of parameters. Make sure to remove the _id field in the return unless it is part of your index.

fields <- list( name=1L, closing_date=1L, _id=0L )
cursor <- mongo.find(mongo,
                     "mydb.homepurchases", 
                     query=list(), 
                     sort=list(closing_date=1L), 
                     fields = fields)

Even with a covering query, this can be slow because the index might be ignored. Instead, instruct R to set a query hint with the index to use. This seems to require building a more complicated description of the query, and to include the sort order in the query instead of passing it as a parameter. Also in the final successful version, the fields are encoded in BSON before passing on to mongo.find().

query <- list("$query" = list(),
              "$orderby" = list(closing_date=1L),
              "$hint" = myfields ) 
query <- mongo.bson.from.list( query )

fields <- list( name=1L, closing_date=1L, _id=0L )
fields <- mongo.bson.from.list( fields )

cursor <- mongo.find(mongo, 
                     "mydb.homepurchases", 
                     query=query,
                     fields=fields)

When retreiving the results, particularly a large number of them, don't use mongo.find.all(), which is very slow due to its implementation. Although ugly, preallocate space for the results and use rmongodbs low level functions. Of course, you can wrap this in a function of your own within your own application.

names <- character(n)
dates <- character(n)  # Use a date class if desired
i <- 0
while(mongo.cursor.next(cursor)) {
    i <- i+1
    val <- mongo.cursor.value(cursor)
    tmp <- mongo.bson.value(val, "name")
    names[i] <- ifelse(length(tmp) > 0, tmp, NA)
    tmp <- mongo.bson.value(val, "closing_date")
    dates[i] <- ifelse(length(tmp) > 0, tmp, NA)
}
mongo.cursor.destroy(cursor)

mongo.destroy(mongo)

df <- data.frame(names=names, dates=dates)

Using the hint in the query sped up the query for over 30,000 records from over 30 seconds to about 0.5 seconds. Preallocating data and using the low level routines to gather results cut the time for data storage by about a factor of 2.

Once this was accomplished, it was possible to include queries to MongoDB in an interactive Shiny app without serious lag.