понедельник, 15 декабря 2008 г.

C:

char buf[1024]
strcpy(buf, user_data)

Python:

buf = user_data[:1024]
if len(user_data) > 1024: security_hole(user_data[1024:])

actually, the translation is not difficult, so long as you implement security_hole() properly.

(c) Twisted Quotes

суббота, 6 декабря 2008 г.

Python syslog as it should be

First, the code:

from logging.handlers import SysLogHandler, SYSLOG_UDP_PORT
from logging import Handler
import socket

class UdpSysLogHandler(SysLogHandler):
def createLock(self): pass
def acquire(self): pass
def release(self): pass

def __init__(self, address=('127.0.0.1', SYSLOG_UDP_PORT), facility=SysLogHandler.LOG_USER):
Handler.__init__(self)
assert type(address) == tuple
self.address = address
self.facility = facility
self.socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
self.formatter = None

def emit(self, record):
msg = self.format(record)
msg = self.log_format_string % (
self.encodePriority(self.facility,
self.mapPriority(record.levelname)),
msg)
try:
self.socket.sendto(msg, self.address)
except (KeyboardInterrupt, SystemExit):
raise
except:
self.handleError(record)

def close(self):
Handler.close(self)
self.socket.close()

How it is different from standard SysLogHandler in Python logging package?

First, a lock is eliminated. This lock is taken (in original version) for each logging operation. And IO is performed inside the lock, which is very bad.

def createLock(self): pass
def acquire(self): pass
def release(self): pass

Second, there is no support for logging through UNIX domain socket (/dev/log). This has slightly simplified the emit method.

One might say that UNIX domain sockets should be faster that actual networking UDP sockets, because they dont involve all the networking stack. In practice, however, I noticed that UNIX domain sockets perform much worser. I dont know why. If someone has a clue, please, let me know.

четверг, 4 декабря 2008 г.

How to save $60 a month with 20 lines of code

We use Amazon AWS S3 service to store users pictures. There is huge amount of small pictures.

As the traffic at out WEB service increased, the cost for serving the pictures from the S3 increased also. Last month we had about 250 GB of traffic for pictures and this was around $60.

While we could move all the pictures to a separate static server this would imply some downtime and complexity in further migrations of our servers. And also reduce reliability.

I modified few lines in Nginx configuration to make it work as a proxy to Amazon S3:

location @s3 {
internal;
proxy_pass http://your-bucket-name.s3.amazonaws.com;
proxy_store on;
proxy_store_access user:rw group:rw all:r;
proxy_temp_path /var/static_data/tmp;
root /var/static_data;
}

location ~* \.(jpg|jpeg|gif|png|ico|css|bmp|js|swf|mp3)$ {
access_log off;
error_log /var/log/nginx/static_cache_miss.log;
expires max;
root /var/static_data;
error_page 404 = @s3;
}

Now the traffic for S3 has reduced from ~10 GB a day to less than 400 MB.

четверг, 27 ноября 2008 г.

Python logging in threaded application

As I said in previous post, I use Python logging module in my WEB server to log via syslog. I use CherryPy as an application server. There is a single Python process as a backend. CherryPy instanse is configured to run about 20 threads to serve the requests.

Using Python profiler I found out that threads spend more than half of their time blocked on a lock inside the logging module:

logging/__init__.py

class Handler(Filterer):
.................
def acquire(self):
"""
Acquire the I/O thread lock.
"""
if self.lock:
self.lock.acquire()
..................
def handle(self, record):
"""
Conditionally emit the specified logging record.

Emission depends on filters which may have been added to the handler.
Wrap the actual emission of the record with acquisition/release of
the I/O thread lock. Returns whether the filter passed the record for
emission.
"""
rv = self.filter(record)
if rv:
self.acquire()
try:
self.emit(record)
finally:
self.release()
return rv
................

As you can see, the lock is taken on every log message. Handler instance is created as a singleton for the application. Bad thing happens: all the threads lock on a single lock and perform an IO operation holding the lock!

What I've done so solve this: I created my own SmartSysLogHander class which uses a separate socket per thread to write to /dev/log. It does not contain lock a all.

Strange thing that such a thing happens using standard library for logging. I guess standard library is designed to be rock solid, not rocket speed. As Guido Van Rossum said, if you dont like threads and GIL in Python, just spawn several Python processes of your scalable application.

Discussion is welcome.

UnicodeEncodeError in Python logging

I use Python logging facility to write my logs via syslog. But when trying to log some Unicode message, emit method somewhere deep in logging throws UnicodeEncodeError. This happens because it tries to send Unicode string to a socket.

I googled it around and found no solution. File handlers support encoding parameters, but others do not.

The simplest way I found to fix this is to use custom formatter:

from logging import Formatter

class Utf8LogFormatter(Formatter):
def format(self, record):
return Formatter.format(self, record).encode('utf8')

Is this a Python logging system problem or am I doing something wrong?