Know Your Tool
Jan Lehnardt - jan@apache.org (Jan) - July 22, 2008The RarestNews developer considers InnoDB and CouchDB for a re-architection of his high volume news site. He did his homework researching, but I couldn’t help but comment on a few things he wrote. The comment turned into a blog post and since this is my blog it should be posted here as well.
I am specifically referring to the paragraphs about InnoDB and CouchDB:
MySQL problems
So, to be technical here I’ve used MyISAM tables (never really liked InnoDB because of it’s slow writes and at 100k new articles a day with lots of meta-data to write about them, like tags, dates, snippets, word frequencies, etc) - it seemed like a good decision. The bad part was that on write MyISAM locks the whole table. So 50 bots scouring the Web for news writing and locking whole table made site almost unresponsive.
I’m not yet sure how to solve it - with InnoDB, with PostgreSQL or with some kind of new-age databases like CouchDB, StrokeDB, maybe Amazon’s SimpleDB, etc…
CouchDB problems
They seem like a nice idea when you read about them, but… there are flaws.. The main problem with CouchDB for example is it’s complete HDD-dependence. Modern memory is hundreds of times faster than DB, so you’re using only 1% of speed if you use HDD-based database. And the second problem is it’s “Do not overwrite” motto. It doesn’t reuse space no longer needed, so if I write a 100KB article to database (along with some other data and then I rewrite this entry - there’s now 200KB stored on my drive) and each update eats 100KB more.
How to avoid it? Compact the database, so it creates a NEW file with only the latter 100KB. And delete the previous database file. So, even I didn’t change anything - I’ve had to write the same data 3 times (along with all of my database in compaction process). What that means.
1) It’s AT LEAST 3 time slower than your HDD speed if you want to effectively use ALL of your hard drive, so now we have only 0.3% of computer speed (compared to memory usage).
2) You can only use databases of size of HALF of your HDD (but in reality more like 33%) to effectively use CouchDB (remember - compaction process creates NEW file, so it needs at least same amount of space as it uses).
I replied:
Heya,
I just want to put in perspective that CouchDB is still in alpha stage and no performance work has been done. Expect the HDD dependency to be less of a problem. In the meantime, a caching HTTP proxy will do the trick for you.
The update-to-write is a design choice with the consequences you correctly line out. But “effectively using your hard drive” might not mean “use the least amount of space at all times”. It is more like don’t talk to the drive if you can avoid it and make as little seeks as possible and that is what CouchDB is designed to do at the expense of deferring another write operation to off-peak times with compaction. Compaction takes advantage of writing en-bulk, which is just flushing data to disk without random seeks. So your equation is close, but not exact
You can use the bulk insertion feature yourself to get less fragmentation in the first place when your crawlers dump their data. This is also fairly fast (3 seeks and a flush of data).
To be frank, I don’t see a crawled news item to change that often. But then, I don’t know what you are doing with it at RarestNews.
Also, InnoDB is not the worst choice. It puts data integrity before speed (which CouchDB does as well), which will always be slower than MyISAM which just doesn’t care for integrity. InnoDB works hard to make sure to hit the HDD as infrequent as possible and if it has to, to read and write in batches.
The difference between InnoDB and CouchDB is that you can control when to do some of the work with CouchDB’s compaction and InnoDB’s mechanisms add to the current load of a system. So CouchDB lets you actually make smart use o your resources.
I’d like to recommend Theo Schlossnagle’s Scalable Internet Architectures.
Among a plethora of useful information, it discusses the design of a system similar to the one you are describing.
Cheers
Jan
—
PS: I work with MySQL on the day job and work on CouchDB in my free time, so I am obviously biased in both ways.
PPS: If you have any questions regarding any of the above, feel free to contact me.
Categories: Blogs Jan Lehnardt
Comments
No comments so far, you could be the first.Add comment
Erlang on Twitter
» koher (koher): 本の並びが順に、Struts、SaaS、Slim3、Ajax、Spring、Scala、Erlang、Seasar2ってもうカオスだ…。
» josevalim (José Valim): Sweet. homebrew updated Erlang recipe to R15B, going to make it the required dependency for Elixir.
» stackfeed (StackOverflow): Erlang: unmarshalling variable length data fields in binary stream: I’m creating an Erlang application that need… http://t.co/0yDGJWz7
» tugocof (Bilski Storer): Viola Tricolor L. in Morphologischer, Anatomischer Und Biologischer Beziehung: Inaugural-Dissertation Zur Erlang… http://t.co/C4ojnX3h
» mickael (Mickaël Rémond): Hehe :)
RT @ostinelli spent the whole day in building an #ejabberd module. i just love this stuff. ^^_ #erlang
» darkproger (proger): RT @metabrew: If you use vim for #erlang, you might be interested in my rebar-friendly vimerl modifications: https://t.co/dSIKOs9p
» bipthelin (Bip Thelin): haven’t seen Hotline in a while RT “@github_erlang: hotline - Browser based Hotline client in Erlang http://t.co/mF50rC7D”
» erlang (Andreas Åkre Solberg): Mine bilder fra vakre Helgeland http://t.co/WNSNhNiw i min nye fancy bildefremviser
» github_erlang (GitHub Erlang): hotline - Browser based Hotline client in Erlang http://t.co/iLT9GmOG
» oki_dimas (Oki dimas mahendra ): Km wuching “@HammyDC: Bkan.. Aq dewa erlang.. RT @oki_dimas Bukan siluman “@HammyDC: Aq jdi yoko klo gtu..”
Statistics
Number of aggregated posts: 10454
Number of comments: 1392
Most recent article: January 31, 2012
Latest comments
» nobelboy on OpaDo Data Storage: Feel free to add some Qs here or contact me offline, and I will see what I can work into…
» darrensy on The Twisted Matrix: This has been a great idea you have shared. covers for kindle
» jony on Principle Software Engineer at LonoCloud (Full-time): That provides will become a internet marketer of little kinds of expert methods developers developing strategy using Erlang/OTP. There will…