Preventing file dupes in files uploaded through Django
Many Django applications make use of uploaded files that should never be re-uploaded again after they've been uploaded once. An example of this in action might be an image (one image embedded in a blog post might not be a big deal for a small site, but news sites often deal with licensed high-definition images used across several stories) or a large PDF with associated metadata (once the metadata is keyed in once, users shouldn't be asked to do so again). Let's talk about two design patterns to handle files, and ways we can deal with duplicates in each one.
The Attachment
With the Attachment design pattern, a file is part of the model as a Django FileField. This is useful if the file to be attached is known ahead of time and will be of a semi-predictable type (for example, a press release for a scheduled report that will be in PDF format). Two or three additional users may be writing blog posts on the report, and each of them plan to attach the report as well. We don't want to store a stack of extra copies of the file on the server - so how can we de-dupe them?
Our post is set up like so:
from django.db import models
class ReportPost(models.Model):
file = models.FileField()
file_sha1 = models.CharField(max_length=40)
post_text = models.TextField()
... etc ...
Our ReportPost
has space for an uploaded file. We also have a sha1
field here. Django can't automatically detect a duplicate file (unique
is not supported for FileFields
or any derivatives), so we need to take the SHA-1 text hash of the file in order to check if it's been uploaded already. We'll take care of that in the ModelAdmin
for this model.
from django.contrib import admin
import hashlib
def generate_sha(file):
sha = hashlib.sha1()
file.seek(0)
while True:
buf = file.read(104857600)
if not buf:
break
sha.update(buf)
sha1 = sha.hexdigest()
file.seek(0)
return sha1
class ReportPostAdmin(admin.ModelAdmin):
... stuff ...
def save_model(self, request, obj, form, change):
sha = generate_sha(obj.file)
obj.file_sha1 = sha
match_qs = ReportPost.objects.filter(file_sha1=sha)
if match_qs.count() > 0:
obj.file = match_qs[0].file
obj.save()
We derive a SHA-1 hash for this file by reading in the first 100 megabytes of data and hashing it. In our ModelAdmin.save_model
method, we store that SHA-1 with our model instance. We also run a query against the database for any model instances with that exact same SHA-1 (indicating a match), and replace our instance of the file with the previously uploaded file.
The File as Object
With this design pattern, files are part of a larger object that gives them relevance. Images in a newsroom app, for example, come with credits, licensing information, and various data about how often the image is used. In order to store that metadata about that image, we need a file object.
# models.py
class ImageObject(models.Model):
caption = models.CharField(max_length=140)
credits = models.CharField(max_length=140)
image_file = models.ImageField()
file_sha1 = models.CharField(max_length=40, unique=True)
# admin.py
class ImageObjectAdmin(admin.ModelAdmin):
... stuff ...
def save_model(self, request, obj, form, change):
sha = generate_sha(obj.file)
obj.file_sha1 = sha
obj.save()
With this example, we add some uniqueness to our model, which gives us the ability to easily enforce that each image file will only have one set of metadata. Our app will throw an IntegrityError
if a user tries to save a duplicate SHA-1 hash. We'll need to handle that IntegrityError
and redirect our user to the previously uploaded image. We can do that using a new middleware object.
Middleware objects in Django quietly do their thing 99.9% of the time, and you'll never think twice about them. They're there for session authentication or CSRF enforcement - things that most of us take for granted. But you can easily add your own middleware with ease by adding a middleware.py
file to your application and registering it with settings.py
:
# settings.py
MIDDLEWARE_CLASSES = (
'django.contrib.sessions.middleware.SessionMiddleware',
... stuff ...
'our_image_app.middleware.RedirectOnSHAViolation',
)
And then in our_image_app/middleware.py
:
# middleware.py
from django.db import IntegrityError
from django.http import HttpResponseRedirect
from utils import generate_sha # same func as before
from our_image_app.models import ImageObject
class RedirectOnSHAViolation():
def process_exception(self, request, exception):
if type(exception) == IntegrityError:
if "our_image_app/imageobject" in request.path:
sha1 = generate_sha1(request.FILES['image_file'])
img = ImageObject.objects.get(file_sha1=sha1)
return HttpResponseRedirect('/admin/our_image_app/imageobject/') + str(img.id))
I never said it was pretty.
Django's middleware system expects to see a class with a process_exception
method that's really, really stupid. The middleware knows it has a request that caused an error, but that's about all that it knows. We have to do two checks to make sure that we're handling the right error here. First, we check that the type
of error is IntegrityError
. We have to run this check first because every single error in our app could potentially hit this method, so we need to filter for only certain types. Once I've confirmed that, I check to see if the IntegrityError
came from the ImageObject
model within our_image_app
- and since ImageObject
only has one unique field, I know an IntegrityError
here must be a duplicate file. Once I've ascertained all of that, then I need to find the file I should have used and redirect the user there instead.
Conclusions
SHA-1 hashes provide an easy way to compensate for Django FileFields' inability to enforce uniqueness by default. Handling those errors can be a bit messy, but we can create some custom middleware to handle any problems or extend our admin's built-in methods to help us out.
Props to @spiggy, who had the original idea of using SHAs to look for duplicate files.