вторник, 16 ноября 2010 г.

When to touch swappiness

There are lots of discussion on the lists on whether to touch or not to touch /proc/sys/vm/swappiness parameter and there is not definitive answers on that. I figured out a situation when tuning the parameter can really improve the performance.


On the machine:
  • RAID-1 of three HDDs
  • 12 GB RAM
  • Apache Cassandra instance with 25 GB of data
  • ejabberd instance
The IO is created by Cassandra, which reads many random data pages and occasionally writes sequential 100-200 Mb chunks of data. Also some IO is created by swapping ejabberd memory in and out.

So most write load is created by swapping out random ejabberd memory pages. And we know that RAID-1 is N times better on read than on write. Decreasing swappiness parameter from 60 to 20 I moved IO load from write to read. There left almost no random spaw writes.

The IO load has really decreased. Not a huge optimization, but worth doing.

суббота, 13 ноября 2010 г.

Apache Cassandra experience

At one of my projects I switched from Postgresql to Cassandra. There were reasons for the switch.

First. For each user I had to keep an inbox for storing incoming messages and events. What is inbox? It is a sorted collection of items. Items are accessed using ranged queries. This caused huge IO overhead on Postgres, because of lack of clustered indexes. All "tables" in Cassandra are clustered, because they are kept as SST (sorted string tables).

Second. My application had huge write throughput. Postgres is good at write with all that write-ahead logs and absence of table-locks on write. And even after write-aware optimizations it still was not enough. Cassandra's data write process is completely different. And it better suits my needs.

Third. Application servers are Python Twisted applications. There is one Postgres binding for Twisted and it is abandoned and buggy. Cassandra API is available via Thrift, which in turn supports Twisted. I recommend great Telephus wrapper for Thrift and Twisted.

At Cassandra's IRC channel people are telling each other of their Cassandra clusters. I look a bit stupid when saying I have a single node. But who cares? If it works better than Postgres for me - why not?

Disclaimer: I am not telling here that Cassandra is better than Postgresql. It just suits better in this certain application. I use Postgresql a lot at in many other projects.

вторник, 9 ноября 2010 г.

Google AppEngine Experience

At first glance, AppEngine is really nice with all that cloud-computing. Pay only for what you use. Scale indefinitely. Of course, you have some limitations, like custom (Python or Java) environment with predefined APIs. But APIs are really good and mostly sufficient.

At second glance, AppEngine is really, really nice! You'll fine a great toolset in SDK and application management console. Version management, quota settings, convenient shell scripts in SDK for deployment and testing. Also log managers, kind of simple profiler, etc. I cant imagine how many efforts were spent on the toolset.

At third glance you'll find AppEngine unusable.

  • After two years of being released there are unexpected errors in the management console. Sometimes I cannot enter it for some hours.
  • When you need to delete a table from datastore - cross your fingers. Sometimes a certain table becomes corrupted and you cannot delete it. Only application recreation helps.
  • AppEngine pricing claims 10 cents for CPU hour. Good. But you have to use the CPU through the API. When I tried to upload my 1 GB database to AppEngine, it took some hours of real time and some days of AppEngine CPU time. It cost me about $60 just to upload my database! I have to admit, this is hard part. But Postgresql does this database back and forth in minutes!
  • Finally, I managed to port my application and to upload all the data. But the cost per pageview is tremendous. I would cost me hundreds bucks a month instead of current inexpensive dedicated server (which is busy about 10% at peaks).