A Case Study of Scalability Related “Out of memory” Crash in Erlang
Caoyuan Blog - - December 31, 2008We are building a platform for message switching, in Erlang. Everything looks OK on stability and features. It actually have run more than half year with zero down. We tested its performance on our 2-core CPU machine before, and got about 140 transactions/second, it’s good enough.
Then, we got a 8-core CPU machine several weeks ago, and we did same performance testing on it, to see the scalability. Since Erlang is almost perfect on scalability, you can image the result, yes, about 700 transactions/second now, scaled almost linear. Until it crashed with “out of memory” when million hits processed.
It left a very big “erl_crash.dump” file there, I had to dig the issue now. My first guess was, were some remote requests (access db, access remote web service etc) timeout but the process itself was not timeout yet, and cause more and more processes kept in VM?
A quick grep “=proc:” erl_crash.dump showed that the total number of processes was about 900, which was resonable for our case.
So, which process ate so many memory? A quick grep “Stack+head” erl_crash.dump showed that there was indeed a process with 285082125 size of Stack+head there.
Following this clue, I caught this process:
=proc:<0.4.0>
State: Garbing
Name: error_logger
Spawned as: proc_lib:init_p/5
Last scheduled in for: io_lib_format:pad_char/2
Spawned by: <0.1.0>
Started: Sun Apr 1 01:21:50 2012
Message queue length: 2086029
Number of heap fragments: 1234053
Heap fragment data: 281266956
Link list: [<0.27.0>, <0.0.0>, {from,<0.42.0>,#Ref<0.0.0.88>}]
Reductions: 72745575
Stack+heap: 285082125
OldHeap: 47828850
Heap unused: 121777661
OldHeap unused: 47828850
Program counter: 0x0764c66c (io_lib_format:pad_char/2 + 4)
CP: 0x0764c1b4 (io_lib_format:collect_cseq/2 + 124)
This bad guy was error_logger, which is from OTP/Erlang standard lib: error_log, writing received messages to log file or tty. The typical usage is:
error_logger:info_msg("~p:~p " ++ Format, [?MODULE, ?LINE] ++ Data))
Which will format Data to a String according to the Format string, and write it to tty or log file.
The above case showed the message queue length of process “error_logger” had reached 1234053, and the Stack+heap is 285082125, about 272M size.
So the cause may be, that the message queue could not be processed in time, the messages were crowded in error_logger’s process and finally caused “out of memory”. The bottle-neck was that when error_logger tried to format the message to String, Erlang VM was weak on processing them, which seemed to need a lot of CPU cycles.
In my previous blog, I talked about Erlang is bad on massive text processing. Erlang processes String/Text via List, which is obvious bottle-neck in Erlang now, with Erlang is getting much and much popular and more and more Erlang applications are written.
But, why this did not happen on our 2-core CPU machine? It’s an interesting scalability related problem:
“error_logger” module will registered one and only one process to receive and handle all log messages. But Erlang VM’s scheduler can not distribute ONE process to use multiple CPUs’ computing ability. In our 2-core machine, the whole ability is about 140 transactions/second, the one process of “error_logger” just happened to have the power to handle corresponding log messages in time. Under 8-core CPUs machine, our platform scales to handle 700 transactions/second, but there still only one process of “error_logger”, which can not use 8-core CPUs’ ability at all, and finally fail on it.
BTW, I think error_logger should cut its message queue when can not process them in time (disk IO may also be slower than receiving messages).
Categories: Blogs Caoyuan Blog
Comments
If the number of messages to error_logger increases linearly with your systems capacity it seems to me that you are logging even successful cases to error_logger. I think the purpose with error_logger is to log errors or infrequent strange things.
I however admit that there is room for improvements in the error_logger implementation as well.
/Kenneth
Posted by Kenneth on 01 Jan 2009 at 11:32Kenneth,
We (at ProcessOne) have an improved version of the error logger that we can esnd you if you wish to check if it is helpful for Erlang / OTP team.
Posted by Mickaël Rémond on 04 Jan 2009 at 19:24Kenneth,
Yes, I should use “disk_log” module. But I think in the blog I meant some interesting scalability issues that I never minded before.
Posted by Caoyuan on 08 Jan 2009 at 21:53I believe the memory sizes are measured in words, i.e. a stack+heap size of 285082125 means 1087MB, not 272MB.
Posted by Erik Søe Sørensen on 17 Aug 2009 at 13:26
Add comment
Erlang on Twitter
» fedorausers (Fedora Linux Users): #linux #fedora Re: erlang-doc - dubious dependencies http://dlvr.it/4cWnw
» bestform (bestform): @danielefrijia lisp und erlang habe ich beides schon beruflich eingesetzt.
» bestform (bestform): Es gibt übrigens durchaus Antworten, die ich akzeptieren würde. Lisp, Erlang, Clojure, von mir aus auch Scala. Na? Wie sieht’s aus? :)
» charpi (Nicolas Charpentier): RT @pavlobaron: Even if #erlang hasn’t been mentioned in the latest #thoughtworks report, it won’t keep us from building real cool things with it
» chrisumbel (chris umbel): the functional work i’ve done in the last 18 months (erlang & clojure) have clearly changed how i write Java & .Net code 4 the good
» grzegorzkazulak (Grzegorz Kazulak): @strzalekk erlang? nice :)
» aprimc (Andrej Primc): Reinventing the wheel. No agreeable template engine in Erlang.
» fedorausers (Fedora Linux Users): #linux #fedora erlang-doc - dubious dependencies http://dlvr.it/4cNwW
» opencrowd (OpenCrowd): Cloudant’s BigCouch is open-source. BigCouch is a set of Erlang/OTP applications for creating a cluster of CouchDBs http://bit.ly/bILJ8p
» Missing_Faktor (Rahul Goma Phulore): RT @pavlobaron: Even if #erlang hasn’t been mentioned in the latest #thoughtworks report, it won’t keep us from building real cool things with it
Statistics
Number of aggregated posts: 10079
Number of comments: 554
Most recent article: September 01, 2010
Latest comments
» Nissan Frontier Superchager on Erlang Doesn’t Fit The JVM: I don’t believe it is the silver bullet that fixes all the problems that required you to do your JVM tuning….Nissan Frontier Superchager
» Nissan Frontier Superchager on What to do About Erlang's Records?: The general solution is to delete all the keys that should have new values, then insert the new key/value pairs…
» videomob on Java And Threads (Jetty): Хватит спамить, накинулись