A Case Study of Scalability Related “Out of memory” Crash in Erlang

Caoyuan Blog - - December 31, 2008

We are building a platform for message switching, in Erlang. Everything looks OK on stability and features. It actually have run more than half year with zero down. We tested its performance on our 2-core CPU machine before, and got about 140 transactions/second, it’s good enough.

Then, we got a 8-core CPU machine several weeks ago, and we did same performance testing on it, to see the scalability. Since Erlang is almost perfect on scalability, you can image the result, yes, about 700 transactions/second now, scaled almost linear. Until it crashed with “out of memory” when million hits processed.

It left a very big “erl_crash.dump” file there, I had to dig the issue now. My first guess was, were some remote requests (access db, access remote web service etc) timeout but the process itself was not timeout yet, and cause more and more processes kept in VM?

A quick grep “=proc:” erl_crash.dump showed that the total number of processes was about 900, which was resonable for our case.

So, which process ate so many memory? A quick grep “Stack+head” erl_crash.dump showed that there was indeed a process with 285082125 size of Stack+head there.

Following this clue, I caught this process:

=proc:<0.4.0>
State: Garbing
Name: error_logger
Spawned as: proc_lib:init_p/5
Last scheduled in for: io_lib_format:pad_char/2
Spawned by: <0.1.0>
Started: Sun Apr 1 01:21:50 2012
Message queue length: 2086029
Number of heap fragments: 1234053
Heap fragment data: 281266956
Link list: [<0.27.0>, <0.0.0>, {from,<0.42.0>,#Ref<0.0.0.88>}]
Reductions: 72745575
Stack+heap: 285082125
OldHeap: 47828850
Heap unused: 121777661
OldHeap unused: 47828850
Program counter: 0x0764c66c (io_lib_format:pad_char/2 + 4)
CP: 0x0764c1b4 (io_lib_format:collect_cseq/2 + 124)

This bad guy was error_logger, which is from OTP/Erlang standard lib: error_log, writing received messages to log file or tty. The typical usage is:

error_logger:info_msg("~p:~p " ++ Format, [?MODULE, ?LINE] ++ Data))

Which will format Data to a String according to the Format string, and write it to tty or log file.

The above case showed the message queue length of process “error_logger” had reached 1234053, and the Stack+heap is 285082125, about 272M size.

So the cause may be, that the message queue could not be processed in time, the messages were crowded in error_logger’s process and finally caused “out of memory”. The bottle-neck was that when error_logger tried to format the message to String, Erlang VM was weak on processing them, which seemed to need a lot of CPU cycles.

In my previous blog, I talked about Erlang is bad on massive text processing. Erlang processes String/Text via List, which is obvious bottle-neck in Erlang now, with Erlang is getting much and much popular and more and more Erlang applications are written.

But, why this did not happen on our 2-core CPU machine? It’s an interesting scalability related problem:

“error_logger” module will registered one and only one process to receive and handle all log messages. But Erlang VM’s scheduler can not distribute ONE process to use multiple CPUs’ computing ability. In our 2-core machine, the whole ability is about 140 transactions/second, the one process of “error_logger” just happened to have the power to handle corresponding log messages in time. Under 8-core CPUs machine, our platform scales to handle 700 transactions/second, but there still only one process of “error_logger”, which can not use 8-core CPUs’ ability at all, and finally fail on it.

BTW, I think error_logger should cut its message queue when can not process them in time (disk IO may also be slower than receiving messages).



Categories: Blogs  Caoyuan Blog  

Comments

anonymous avatar

If the number of messages to error_logger increases linearly with your systems capacity it seems to me that you are logging even successful cases to error_logger. I think the purpose with error_logger is to log errors or infrequent strange things.

I however admit that there is room for improvements in the error_logger implementation as well.

/Kenneth

Posted by Kenneth on 01 Jan 2009 at 11:32



 
Mickaël Rémond's avatar

Kenneth,

We (at ProcessOne) have an improved version of the error logger that we can esnd you if you wish to check if it is helpful for Erlang / OTP team.

Posted by Mickaël Rémond on 04 Jan 2009 at 19:24



 
anonymous avatar

Kenneth,

Yes, I should use “disk_log” module. But I think in the blog I meant some interesting scalability issues that I never minded before.

Posted by Caoyuan on 08 Jan 2009 at 21:53



 
anonymous avatar

I believe the memory sizes are measured in words, i.e. a stack+heap size of 285082125 means 1087MB, not 272MB.

Posted by Erik Søe Sørensen on 17 Aug 2009 at 13:26



 


Add comment

Name:

Email:

URL:

Smileys

Remember my personal information

Notify me of follow-up comments?