Rick Copeland is the principal consultant at Arborian Consulting, LLC, where he helps clients build custom web applications using Python and MongoDB. He previously worked as a lead software engineer at SourceForge, where he helped lead the transformation from a PHP/Postgres/MySQL codebase to a Python/MongoDB codebase. Rick is the primary author of Ming, a Python object mapper for MongoDB, and Zarkov, a realtime analytics platform based on MongoDB. Prior to GeekNet, Rick worked in fields from retail analytics to hardware chip design. Rick's personal blog is hosted at Just a Little Python. Rick has posted 25 posts at DZone. You can read more from them at their website. View Full User Profile

A Taste of GridFS, MongoDB's Filesystems

05.26.2012
| 2792 views |
  • submit to reddit

In some previous posts on mongodb and python and pymongo, I introduced the NoSQL database MongoDB and how you can use it from Python. This post goes beyond the basics of MongoDB and pymongo to give you a taste for MongoDB's take on filesystems, GridFS.

Why a filesystem?

If you've been doing MongoDB for a while, you may have heard about the 16 MB document size limit. When I started using MongoDB (around version 0.8), the limit was actually 4 MB. What this means is that everything is working just fine, your system is screaming fast, until you try to create a document that's 4.001 MB and boom nothing works any more. For us at SourceForge, what that meant was that we had to restructure our schema and use less embedding.

But what if it's not something that can be restructured? Maybe your site allows users to upload large attachments of unknown size. In such cases you probably can get away with using a Binary field type and crossing your fingers, but a better solution, in my opinion, is to actually store the contents your upload in a series of documents (let's call them "chunks") of limited size. Then you can tie them all together with another document that specifies all the file metadata.

GridFS to the rescue

Well, that's exactly what GridFS does, but it does it with a nice API with a few more bells and whistles than you'd probably build on your own. It's important to note that GridFS, implemented in all the MongoDB language drivers, is a convention and an api, not something that's provided natively by the server. As far as the server is concerned, it's all just collections and documents.

The GridFS schema

GridFS actually stores your files in two collections, named by default fs.files and fs.chunks, although you can change the fs to something else if you'd like. The fs.files collection is where reading or writing a file begins. A typical fs.files document looks like the following (credit):

{
  // unique ID for this file
  "_id" : <unspecified>,
  // size of the file in bytes
  "length" : data_number,
  // size of each of the chunks.  Default is 256k
  "chunkSize" : data_number,
  // date when object first stored
  "uploadDate" : data_date,
  // result of running the "filemd5" command on this file's chunks
  "md5" : data_string
}

The fs.chunks collection contains all the data for your files:

{
  // object id of the chunk in the _chunks collection
  "_id" : <unspecified>,
  // _id of the corresponding files collection entry
  "files_id" : <unspecified>,
  // chunks are numbered in order, starting with 0
  "n" : chunk_number,
  // the chunk's payload as a BSON binary type
  "data" : data_binary,
}

In the Python gridfs package (included with the pymongo driver), several other fields are inserted as well:

filename
This is the 'human' name for the file, which may be path-delimited to simulate directories.
contentType
This is the mime-type of the file
encoding
This is the unicode encoding used for text files.

You can also add in your own attributes to files. At SourceForge, we used things like project_id or forum_id to allow the same filename to be uploaded to multiple places on the site without worrying about namespace collisions. To keep your code future-proof, you should put any custom attributes inside an embedded metadata document, just in case the gridfs spec expands to incorporate more fields.

Using GridFS

So with all that out of the way, how to you actually use GridFS? It's actually pretty straightforward. The first thing you need is a reference to a GridFS filesystem:

>>> import pymongo
>>> import gridfs
>>> conn = pymongo.Connection()
>>> db = conn.gridfs_test
>>> fs = gridfs.GridFS(db)

Basic reading and writing

Once you have the filesystem, you can start putting stuff in it:

>>> with fs.new_file() as fp:
...     fp.write('This is my new file. It is teh awezum!')

Let's examine the underlying collections to see what actually happened:

>>> list(db.fs.files.find())
[{u'length': 38,
  u'_id': ObjectId('4fbfa7b9fb72f096bd000000'),
  u'uploadDate': datetime.datetime(2012, 5, 25, 15, 39, 37, 55000),
  u'md5': u'332de5ca08b73218a8777da69293576a',
  u'chunkSize': 262144}]
>>> list(db.fs.chunks.find())
[{u'files_id': ObjectId('4fbfa7b9fb72f096bd000000'),
  u'_id': ObjectId('4fbfa7b9fb72f096bd000001'),
  u'data': Binary('This is my new file. It is teh awezum!', 0),
  u'n': 0}]

You can see that there's really nothing surprising or mysterious happening there; it's just mapping the filesystem metaphor onto MongoDB documents. In this case, our file was small enough that it didn't need to be split into chunks. We can force split it by specifying a small chunkSize when creating the file:

>>> with fs.new_file(chunkSize=10) as fp:
...     fp.write('This is file number 2. It should be split into several chunks')
...
>>> fp
<gridfs.grid_file.GridIn object at 0x1010f5950>
>>> fp._id
ObjectId('4fbfa8ddfb72f0971c000000')
>>> list(db.fs.chunks.find(dict(files_id=fp._id)))
[{... u'data': Binary('This is fi', 0), u'n': 0},
 {... u'data': Binary('le number ', 0), u'n': 1},
 {... u'data': Binary('2. It shou', 0), u'n': 2},
 {... u'data': Binary('ld be spli', 0), u'n': 3},
 {... u'data': Binary('t into sev', 0), u'n': 4},
 {... u'data': Binary('eral chunk', 0), u'n': 5},
 {... u'data': Binary('s', 0), u'n': 6}]

Now, if we actually want to read the file as a file, we'll need to use the gridfs api:

>>> with fs.get(fp._id) as fp_read:
...     print fp_read.read()
...
This is file number 2. It should be split into several chunks

Treating it more like a filesystem

There are several other convenience methods bundled into the GridFS object to give more filesystem-like behavior. For instance, new_file() takes any number of keyword arguments that will get added onto the fs.files document being created:

>>> with fs.new_file(
...     filename='file.txt', 
...     content_type='text/plain', 
...     my_other_attribute=42) as fp:
...     fp.write('New file')
...
>>> fp
<gridfs.grid_file.GridIn object at 0x1010f59d0>
>>> db.fs.files.find_one(dict(_id=fp._id))
{u'contentType': u'text/plain',
 u'chunkSize': 262144,
 u'my_other_attribute': 42,
 u'filename': u'file.txt',
 u'length': 8,
 u'uploadDate': datetime.datetime(2012, 5, 25, 15, 53, 1, 973000),
 u'_id': ObjectId('4fbfaaddfb72f0971c000008'), u'md5':
 u'681e10aecbafd7dd385fa51798ca0fd6'}

Better would be to encapsulate my_other_attribute into the metadata property:

>>> with fs.new_file(
...     filename='file2.txt', 
...     content_type='text/plain', 
...     metadata=dict(my_other_attribute=42)) as fp:
...     fp.write('New file 2')
...
>>> db.fs.files.find_one(dict(_id=fp._id))
{u'contentType': u'text/plain',
 u'chunkSize': 262144,
 u'metadata': {u'my_other_attribute': 42},
 u'filename': u'file2.txt',
 u'length': 10,
 u'uploadDate': datetime.datetime(2012, 5, 25, 15, 54, 5, 67000),
 u'_id':ObjectId('4fbfab1dfb72f0971c00000a'),
 u'md5': u'9e4eea3dec28d8346b52f18086437ac7'}

We can also "overwrite" files by filename, but since GridFS actually indexes files by _id, it doesn't get rid of the old file, it just versions it:

>>> with fs.new_file(filename='file.txt', content_type='text/plain') as fp:
...     fp.write('Overwrite the so-called "New file"')
...

Now, if we want to retrieve the file by filename, we can use get_version or get_last_version:

>>> fs.get_last_version('file.txt').read()
'Overwrite the so-called "New file"'
>>> fs.get_version('file.txt', 0).read()
'New file'

Since we've been uploading files with a filename property, we can also list the files in gridfs:

>>> fs.list()
[u'file.txt', u'file2.txt']

We can also remove files, of course:

>>> fp = fs.get_last_version('file.txt')
>>> fs.delete(fp._id)
>>> fs.list()
[u'file.txt', u'file2.txt']
>>> fs.get_last_version('file.txt').read()
'New file'

Note that since only one version of "file.txt" was removed, we still have a file named "file.txt" in the filesystem.

Finally, gridfs also provides convenience methods for determining if a file exists and for quickly writing a short file into grifs:

>>> fs.exists(fp._id)
False
>>> fs.exists(filename='file.txt')
True
>>> fs.exists({'filename': 'file.txt'}) # equivalent to above
True
>>> fs.put('The quick brown fox', filename='typingtest.txt')
ObjectId('4fbfad74fb72f0971c00000e')
>>> fs.get_last_version('typingtest.txt').read()
'The quick brown fox'

So that's the whirlwind tour of GridFS. I'd love to hear how you're using GridFS in your project, or if you think it might be a good fit, so please drop me a line in the comments.

Published at DZone with permission of its author, Rick Copeland. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)