Django provides an friendly Object Relational Mapping (ORM) framework. In several of my data analysis projects, I used Django ORM to process millions of logcat data generated by hundreds of Android phones. Here are some of the experiences and tips that helps making the processing just a bit faster.
DEBUG
Flag
First of all, set DEBUG
to False
in settings.py
. With DEBUG
as True
,
Django will keep in memory the DB queries it has run so far, which lead to
memory leak if you have a large batch of importing work.
Control Transaction Manually
By default, Django will wrap each database operation a separate transaction, and
commit them automatically. Accessing database frequently definitely will slow you
down, especially when all you want to do is just to insert (a large amount of)
data. Django's transaction module provides several functions to let
you control when to commit the transaction. My favorite one is to use
transaction.commit_on_success
to wrap the function that import data for a
individual device. An addition benefit is, now you know the data importing for
each device either finished completely, or didn't get imported at all. So if
something wrong happens during the importing, or you have to stop it in the
middle for some reason. Next time when you rerun the importing, you won't get
duplicate rows!
Bulk Create Rows
When you have lots of data that you want to import into the database, instead of
call each objects save
function individually, you can store them in a list and
use the object manager's bulk_create
function. It'll insert the list of
objects into the database "in an efficient manner". Use this technique together
with the transaction.commite_on_success
mentioned above, the data importing
should be fast enough.
Iterator
Now all the raw data is imported into database, the next thing you want to do
is probably run second pass of processing, filtering, or whatever. When the data
size is large, it's unlikely that you need to use them again and again. Most of
the time, you just want to iterate through each log line, get some statistical
information, or some simple computation. So after you construct your (crazy)
query set, you want to add an .iterator()
function after it, so Django knows
you just want to iterate the data once, and will not bother to cache them.
Otherwise, Django will cache the query results, and soon you will find your
system freezes, and the kernel does nothing but swapping...
Reset Queries And Garbage Collection
Every now and then you can also reset Django queries manually with the
reset_queries
function, and trigger garbage collection using gc.collect()
.
They'll help you to further reduce memory usage.