A Case Study of Scalability Related “Out of memory” Crash in Erlang
Caoyuan Blog - - December 31, 2008We are building a platform for message switching, in Erlang. Everything looks OK on stability and features. It actually have run more than half year with zero down. We tested its performance on our 2-core CPU machine before, and got about 140 transactions/second, it’s good enough.
Then, we got a 8-core CPU machine several weeks ago, and we did same performance testing on it, to see the scalability. Since Erlang is almost perfect on scalability, you can image the result, yes, about 700 transactions/second now, scaled almost linear. Until it crashed with “out of memory” when million hits processed.
It left a very big “erl_crash.dump” file there, I had to dig the issue now. My first guess was, were some remote requests (access db, access remote web service etc) timeout but the process itself was not timeout yet, and cause more and more processes kept in VM?
A quick grep “=proc:” erl_crash.dump showed that the total number of processes was about 900, which was resonable for our case.
So, which process ate so many memory? A quick grep “Stack+head” erl_crash.dump showed that there was indeed a process with 285082125 size of Stack+head there.
Following this clue, I caught this process:
=proc:<0.4.0>
State: Garbing
Name: error_logger
Spawned as: proc_lib:init_p/5
Last scheduled in for: io_lib_format:pad_char/2
Spawned by: <0.1.0>
Started: Sun Apr 1 01:21:50 2012
Message queue length: 2086029
Number of heap fragments: 1234053
Heap fragment data: 281266956
Link list: [<0.27.0>, <0.0.0>, {from,<0.42.0>,#Ref<0.0.0.88>}]
Reductions: 72745575
Stack+heap: 285082125
OldHeap: 47828850
Heap unused: 121777661
OldHeap unused: 47828850
Program counter: 0x0764c66c (io_lib_format:pad_char/2 + 4)
CP: 0x0764c1b4 (io_lib_format:collect_cseq/2 + 124)
This bad guy was error_logger, which is from OTP/Erlang standard lib: error_log, writing received messages to log file or tty. The typical usage is:
error_logger:info_msg("~p:~p " ++ Format, [?MODULE, ?LINE] ++ Data))
Which will format Data to a String according to the Format string, and write it to tty or log file.
The above case showed the message queue length of process “error_logger” had reached 1234053, and the Stack+heap is 285082125, about 272M size.
So the cause may be, that the message queue could not be processed in time, the messages were crowded in error_logger’s process and finally caused “out of memory”. The bottle-neck was that when error_logger tried to format the message to String, Erlang VM was weak on processing them, which seemed to need a lot of CPU cycles.
In my previous blog, I talked about Erlang is bad on massive text processing. Erlang processes String/Text via List, which is obvious bottle-neck in Erlang now, with Erlang is getting much and much popular and more and more Erlang applications are written.
But, why this did not happen on our 2-core CPU machine? It’s an interesting scalability related problem:
“error_logger” module will registered one and only one process to receive and handle all log messages. But Erlang VM’s scheduler can not distribute ONE process to use multiple CPUs’ computing ability. In our 2-core machine, the whole ability is about 140 transactions/second, the one process of “error_logger” just happened to have the power to handle corresponding log messages in time. Under 8-core CPUs machine, our platform scales to handle 700 transactions/second, but there still only one process of “error_logger”, which can not use 8-core CPUs’ ability at all, and finally fail on it.
BTW, I think error_logger should cut its message queue when can not process them in time (disk IO may also be slower than receiving messages).
Categories: Blogs Caoyuan Blog
Comments
If the number of messages to error_logger increases linearly with your systems capacity it seems to me that you are logging even successful cases to error_logger. I think the purpose with error_logger is to log errors or infrequent strange things.
I however admit that there is room for improvements in the error_logger implementation as well.
/Kenneth
Posted by Kenneth on 01 Jan 2009 at 11:32Kenneth,
We (at ProcessOne) have an improved version of the error logger that we can esnd you if you wish to check if it is helpful for Erlang / OTP team.
Posted by Mickaël Rémond on 04 Jan 2009 at 19:24Kenneth,
Yes, I should use “disk_log” module. But I think in the blog I meant some interesting scalability issues that I never minded before.
Posted by Caoyuan on 08 Jan 2009 at 21:53I believe the memory sizes are measured in words, i.e. a stack+heap size of 285082125 means 1087MB, not 272MB.
Posted by Erik Søe Sørensen on 17 Aug 2009 at 13:26I am accustomed with the abstraction you are apropos to, pass4sure 650-568 but annihilation in that abstraction shows an access by added than 200% via assorted optimization, pass4sure 642-524 which is what happens on OSX for instance application my queuing mechanism. pass4sure 642-504 btw i already am application atom polling and such.
Posted by Neon21 on 11 Feb 2011 at 07:39My work uses Erlang to help organize our output of 3D objects made from plastics. It helps me a lot with the rapid prototype work I am personally responsible for. I haven’t ran into the “out of memory” error but will keep an eye out for it now. It took some getting use to Erlang since I come from a C+ background, but it’s a nice code nonetheless.
Posted by JD on 04 Mar 2011 at 04:34All scales have a weight capacity, some are as small as a kitchen scale measuring grams to scales big enough to measure a truck.
livestock scales
Overland has given their crash dump write up a face lift so if you are needing fast server support with comprehensive write ups they have very detailed resources to look up to.
George
IT Support
Add comment
Erlang on Twitter
» despenjahatdos (Jon champion): Eits jangan salah begini2 saya titisan dewa erlang RT @yolapitalokaa: Yg ngepost twit kyknya jg lg galau drtd ... http://t.co/QfCyVSIl
» erlangtriaji (erlang triaji ): Sini sun ahahaha RT @Encays: Udah udah, lo berduaan aja RT @revianh: Kepooo! RT @erlangtriaji: Hadir RT @Encays: Udah, sama erlang aj
» Encays (antarif cahyadi): Menjepit RT @erlangtriaji: Tegang! RT @revianh: Kepooo! RT @erlangtriaji: Hadir RT @Encays: Udah, sama erlang aja RT @revianh
» erlangtriaji (erlang triaji ): Tegang! RT @revianh: Kepooo! RT @erlangtriaji: Hadir RT @Encays: Udah, sama erlang aja RT @revianh: Nanggepnya lama banget
» Encays (antarif cahyadi): Udah udah, lo berduaan aja RT @revianh: Kepooo! RT @erlangtriaji: Hadir RT @Encays: Udah, sama erlang aja RT @revianh: Nanggepnya lama
» revianh (Revian Hermansyah): Kepooo! RT @erlangtriaji: Hadir RT @Encays: Udah, sama erlang aja RT @revianh: Nanggepnya lama banget -_-
» erlangtriaji (erlang triaji ): Hadir RT @Encays: Udah, sama erlang aja RT @revianh: Nanggepnya lama banget -_-
» Encays (antarif cahyadi): Udah, sama erlang aja RT @revianh: Nanggepnya lama banget -_-
» mshiba64 (Masami Shibatani): Erlangではシリアライズはterm_to_binaryというBuilt-in-functionで実行される。画像データもErlangで扱われるいくつかのTermも全てBinary型に可逆変換できる。
» tomohikoseven (tomohiko nagase): 更新した。|andreのブログ: Erlang avl tree insert を作った : http://t.co/4uBqenSw
Statistics
Number of aggregated posts: 10454
Number of comments: 1392
Most recent article: January 31, 2012
Latest comments
» nobelboy on OpaDo Data Storage: Feel free to add some Qs here or contact me offline, and I will see what I can work into…
» darrensy on The Twisted Matrix: This has been a great idea you have shared. covers for kindle
» jony on Principle Software Engineer at LonoCloud (Full-time): That provides will become a internet marketer of little kinds of expert methods developers developing strategy using Erlang/OTP. There will…