<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Linux OOM Killer</title>
	<atom:link href="http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/feed/" rel="self" type="application/rss+xml" />
	<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/</link>
	<description>Ideas on Databases, Logic, and Language by Jeff Davis</description>
	<lastBuildDate>Thu, 11 Mar 2010 04:54:05 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Lou Gosselin</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-177</link>
		<dc:creator>Lou Gosselin</dc:creator>
		<pubDate>Fri, 26 Feb 2010 02:43:23 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-177</guid>
		<description>Like you, I am completely dumbfounded with the &quot;solution&quot; to this problem. 

I am pro-Linux, but that any of the kernel developers defend this OOM mechanism is an embarrassment. It&#039;s so obviously wrong to arbitrarily kill processes by any heuristic. Even if the heuristic were improved, there&#039;s no reason (ever) to be killing off applications which were behaving correctly. Sounds like that was designed by a banker, doesn&#039;t it?

Process badness?? Who are we kidding here, this is pure kernel badness. This is a glaring flaw on an otherwise reliable platform. 

The solution should address the real problems:
1. The system should never over-commit. Not enough memory to fulfill a new memory request, then fail *deterministically* where applications can deal with it.

2. If this prevents a process from forking, then so be it. At least the system remains stable, and the failed request can be handled gracefully.

3. Maybe the fork/exec combination is fundamentally flawed. Even when it works it&#039;s not particularly efficient, this could be an opportunity to improve the system calls.

4. Some feel that over-commit is desirable. I personally disagree that this is acceptable in any production setting, but if it&#039;s necessary then it should be explicit. The kernel must always honor the memory contracts between itself and processes, no exceptions. If an application wants/needs to over commit ram it probably won&#039;t use, then let it do so explicitly and at it&#039;s own risk.

Conceptually a modified fork call could keep the parent safe from overcommit, and allow it&#039;s child to over commit until it runs exec, the new executable could do whatever it pleases with regards to overcommit, but by default all processes have the right to expect the kernel will honor it&#039;s obligations. 

I simply cannot believe my ears when i hear Linux kernel hackers debating  OOM process killer selection heuristics rather than how to actually fix the problem.</description>
		<content:encoded><![CDATA[<p>Like you, I am completely dumbfounded with the &#8220;solution&#8221; to this problem. </p>
<p>I am pro-Linux, but that any of the kernel developers defend this OOM mechanism is an embarrassment. It&#8217;s so obviously wrong to arbitrarily kill processes by any heuristic. Even if the heuristic were improved, there&#8217;s no reason (ever) to be killing off applications which were behaving correctly. Sounds like that was designed by a banker, doesn&#8217;t it?</p>
<p>Process badness?? Who are we kidding here, this is pure kernel badness. This is a glaring flaw on an otherwise reliable platform. </p>
<p>The solution should address the real problems:<br />
1. The system should never over-commit. Not enough memory to fulfill a new memory request, then fail *deterministically* where applications can deal with it.</p>
<p>2. If this prevents a process from forking, then so be it. At least the system remains stable, and the failed request can be handled gracefully.</p>
<p>3. Maybe the fork/exec combination is fundamentally flawed. Even when it works it&#8217;s not particularly efficient, this could be an opportunity to improve the system calls.</p>
<p>4. Some feel that over-commit is desirable. I personally disagree that this is acceptable in any production setting, but if it&#8217;s necessary then it should be explicit. The kernel must always honor the memory contracts between itself and processes, no exceptions. If an application wants/needs to over commit ram it probably won&#8217;t use, then let it do so explicitly and at it&#8217;s own risk.</p>
<p>Conceptually a modified fork call could keep the parent safe from overcommit, and allow it&#8217;s child to over commit until it runs exec, the new executable could do whatever it pleases with regards to overcommit, but by default all processes have the right to expect the kernel will honor it&#8217;s obligations. </p>
<p>I simply cannot believe my ears when i hear Linux kernel hackers debating  OOM process killer selection heuristics rather than how to actually fix the problem.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: PiperOq</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-172</link>
		<dc:creator>PiperOq</dc:creator>
		<pubDate>Wed, 13 Jan 2010 19:23:39 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-172</guid>
		<description>Oh, what I can observe? Same superior release just about this good topic I utilized for &lt;a href=&quot;http://www.4submission.com&quot; rel=&quot;nofollow&quot;&gt;article submission service&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p>Oh, what I can observe? Same superior release just about this good topic I utilized for <a href="http://www.4submission.com" rel="nofollow">article submission service</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Evan P</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-164</link>
		<dc:creator>Evan P</dc:creator>
		<pubDate>Tue, 08 Dec 2009 09:15:20 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-164</guid>
		<description>Sorry, some sloppy thinking on my part. What I said is only true if the application is free()ing and never malloc()ing, and the kernel is off by whatever amortization margin the malloc() implementation uses (3MB for the typical FreeBSD 8.0 configuration, as I mentioned earlier).

Your syscall idea is like an mlock() that reserves either or both of RAM and swap instead of only RAM, or like a MAP_RESERVE flag for Linux&#039;s mmap() to compliment the MAP_NORESERVE flag. I think there&#039;s some merit to that suggestion.

The only other thing I&#039;d note is that heap implementations typically give each thread its own arena so as to minimize lock contention on SMP systems. Each of them would need an independent amortized syscall; what I&#039;m getting at is that if this were a workable solution in general, /proc/sys/vm/overcommit_memory==2 would be the default.

Of course, it&#039;s probably a perfectly workable default for people running Postgres ;-)</description>
		<content:encoded><![CDATA[<p>Sorry, some sloppy thinking on my part. What I said is only true if the application is free()ing and never malloc()ing, and the kernel is off by whatever amortization margin the malloc() implementation uses (3MB for the typical FreeBSD 8.0 configuration, as I mentioned earlier).</p>
<p>Your syscall idea is like an mlock() that reserves either or both of RAM and swap instead of only RAM, or like a MAP_RESERVE flag for Linux&#8217;s mmap() to compliment the MAP_NORESERVE flag. I think there&#8217;s some merit to that suggestion.</p>
<p>The only other thing I&#8217;d note is that heap implementations typically give each thread its own arena so as to minimize lock contention on SMP systems. Each of them would need an independent amortized syscall; what I&#8217;m getting at is that if this were a workable solution in general, /proc/sys/vm/overcommit_memory==2 would be the default.</p>
<p>Of course, it&#8217;s probably a perfectly workable default for people running Postgres <img src='http://thoughts.j-davis.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pádraig Brady</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-162</link>
		<dc:creator>Pádraig Brady</dc:creator>
		<pubDate>Mon, 07 Dec 2009 14:11:10 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-162</guid>
		<description>If FreeBSD handles it fine then that&#039;s cool (albeit beside the point). PostgreSQL will exit or is in a detectable bad state and can be restarted.</description>
		<content:encoded><![CDATA[<p>If FreeBSD handles it fine then that&#8217;s cool (albeit beside the point). PostgreSQL will exit or is in a detectable bad state and can be restarted.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Log Buffer</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-161</link>
		<dc:creator>Log Buffer</dc:creator>
		<pubDate>Fri, 04 Dec 2009 20:25:06 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-161</guid>
		<description>&quot;[...]From Jeff Davis’s Experimental Thoughts comes this post on Postgres and the Linux OOM Killer. ... Yes, that would make any Postgres DBA feel kind of ranty.&quot;

&lt;a href=&quot;http://www.pythian.com/news/6165/log-buffer-171-a-carnival-of-the-vanities-for-dbas/&quot; rel=&quot;nofollow&quot;&gt;Log Buffer #171&lt;/a&gt;</description>
		<content:encoded><![CDATA[<p>&#8220;[...]From Jeff Davis’s Experimental Thoughts comes this post on Postgres and the Linux OOM Killer. &#8230; Yes, that would make any Postgres DBA feel kind of ranty.&#8221;</p>
<p><a href="http://www.pythian.com/news/6165/log-buffer-171-a-carnival-of-the-vanities-for-dbas/" rel="nofollow">Log Buffer #171</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeff Davis</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-160</link>
		<dc:creator>Jeff Davis</dc:creator>
		<pubDate>Wed, 02 Dec 2009 19:09:49 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-160</guid>
		<description>&quot;kernel already knows exactly how much memory the application is using&quot;

Knows how much the application is using, or knows how much malloc has told the application that it can use?

I thought the problem was VM space versus actually allocated memory.

If, in addition to allocating VM space, malloc explicitly asked the OS to reserve memory (say, every 1024 pages or so to amortize the cost of a syscall) before returning a valid pointer, then the OS would have an opportunity to say &quot;there&#039;s no way I can reserve those pages for you&quot;, and malloc could return NULL.

I understand there are still COW issues with fork and dlopen, but I&#039;m not looking for a perfect solution. In those cases the OS has a pretty good idea how much it has extended itself (number of processes sharing that page that might try to write).

The main problem is with malloc, where (as you say) there is such a huge disconnect between the VM size and the memory that the application has requested that the OS doesn&#039;t know what kind of trouble its in early enough, and even if it does, malloc has no way of knowing the OS may not be able to satisfy the requests.</description>
		<content:encoded><![CDATA[<p>&#8220;kernel already knows exactly how much memory the application is using&#8221;</p>
<p>Knows how much the application is using, or knows how much malloc has told the application that it can use?</p>
<p>I thought the problem was VM space versus actually allocated memory.</p>
<p>If, in addition to allocating VM space, malloc explicitly asked the OS to reserve memory (say, every 1024 pages or so to amortize the cost of a syscall) before returning a valid pointer, then the OS would have an opportunity to say &#8220;there&#8217;s no way I can reserve those pages for you&#8221;, and malloc could return NULL.</p>
<p>I understand there are still COW issues with fork and dlopen, but I&#8217;m not looking for a perfect solution. In those cases the OS has a pretty good idea how much it has extended itself (number of processes sharing that page that might try to write).</p>
<p>The main problem is with malloc, where (as you say) there is such a huge disconnect between the VM size and the memory that the application has requested that the OS doesn&#8217;t know what kind of trouble its in early enough, and even if it does, malloc has no way of knowing the OS may not be able to satisfy the requests.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan Farina</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-159</link>
		<dc:creator>Dan Farina</dc:creator>
		<pubDate>Wed, 02 Dec 2009 18:08:12 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-159</guid>
		<description>Then the distributors can complain :)  But end users that install Postgres will be happier and less surprised and the region of pain will shrink a little bit to packages.  Sub-ideal, but progress.</description>
		<content:encoded><![CDATA[<p>Then the distributors can complain <img src='http://thoughts.j-davis.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />   But end users that install Postgres will be happier and less surprised and the region of pain will shrink a little bit to packages.  Sub-ideal, but progress.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Evan P</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-158</link>
		<dc:creator>Evan P</dc:creator>
		<pubDate>Wed, 02 Dec 2009 08:19:33 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-158</guid>
		<description>Actually, malloc has no idea how much memory within the heap is in use and won&#039;t page fault: you&#039;re ignoring fork(). In any case the kernel already knows exactly how much memory the application is using within the margin of error maintained by the malloc() implementation, so you don&#039;t need to tell it with a syscall. It just doesn&#039;t have much of an opportunity to act on that information until there&#039;s already a pending page fault it can&#039;t service without killing something first.

Also, don&#039;t forget things like dlopen(), which makes a large shared mapping and then sparsely copy-on-write faults individual pages as it performs relocations: the address space based guess you proposed in (4) would trip up on that, too.</description>
		<content:encoded><![CDATA[<p>Actually, malloc has no idea how much memory within the heap is in use and won&#8217;t page fault: you&#8217;re ignoring fork(). In any case the kernel already knows exactly how much memory the application is using within the margin of error maintained by the malloc() implementation, so you don&#8217;t need to tell it with a syscall. It just doesn&#8217;t have much of an opportunity to act on that information until there&#8217;s already a pending page fault it can&#8217;t service without killing something first.</p>
<p>Also, don&#8217;t forget things like dlopen(), which makes a large shared mapping and then sparsely copy-on-write faults individual pages as it performs relocations: the address space based guess you proposed in (4) would trip up on that, too.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Evan P</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-157</link>
		<dc:creator>Evan P</dc:creator>
		<pubDate>Wed, 02 Dec 2009 08:01:09 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-157</guid>
		<description>In response to (1): I&#039;m sure Postgres does degrade quite well on a system with overcommit disabled, but you&#039;re greatly underestimating how difficult it is to degrade gracefully in an OOM condition when memory is overcommitted. Consider that, by the time an OOM condition is reached, the paging daemon has been awake long enough to give every single memory access attempted by Postgres a non-zero risk of page faulting--code, data, heap, stack, anything. Postgres might squeak by because the memory it needs to touch in order to ROLLBACK is probably last in the pager&#039;s LRU queue, but it does so on dumb luck, with a probability that has nothing to do with its own code&#039;s carefulness and everything to do with what else is happening in the system. The &quot;bizarre situation&quot; is actually the common case, because once the paging deamon awakens having touched a page in the past gives you little protection. 

By the way, calling free() doesn&#039;t actually free memory just as calling malloc() doesn&#039;t actually allocate it. To actually free physical pages, a malloc() implementation must unmap the address space to which they are mapped using sbrk() or munmap(), or must madvise(MADV_FREE) it. Both of these operations are quite expensive, so malloc() implementations amortize the cost. The default configuration of FreeBSD&#039;s jemalloc, for example, unmaps in 1M (contiguous) chunks and lets 2M (discontiguous) accummulate before calling madvise().

In response to (2) I&#039;ll stress again that degrading gracefully after hitting an RLIMIT_* is an entirely different beast than degrading gracefully after the paging daemon has injected non-determinism into your application. And of course it&#039;s worse than that; you can&#039;t even necessarily perform I/O in an OOM condition--the data structures that the syscalls build and submit to the block layer and I/O scheduler and so on obviously must be allocated first.

(I should concede, here, that on a system without swap your arguments are on much stronger ground.)

I don&#039;t address the shared memory counting (3) because I agree with you. Hopefully the work discussed in the LWN article I mentioned, http://lwn.net/Articles/359998/, will go somewhere.

With respect to (4): Yeah, you can make a guess based on allocated VM space, but it would give you a false positive more often than you think. Say you have a system with only 3M or so free and you&#039;re sitting at a shell. Suppose further that you happen to have a pre-compiled &quot;Hello, World&quot; application sitting around. Chances are you wouldn&#039;t be able to run it: the first thing it&#039;s going to do when you run it is mmap() /lib/libc.so, which is probably 2M or so; then it&#039;ll try to call printf(). Even if we ignore stdio&#039;s internal buffers, printf() typically calls malloc() internally while building it&#039;s string, and the first thing malloc() is going to do is sbrk()/mmap() at least 1M of address space because it&#039;s trying to amortize that cost over lots of malloc()s, and doesn&#039;t realize the program is about to terminate. Boom, allocated VM space exceeds available memory--never mind the fact such an process probably has a peak unshared memory footprint of like 16K.

No system call is cheap on the time scale relevant to a good general purpose malloc implementation, unfortunately.</description>
		<content:encoded><![CDATA[<p>In response to (1): I&#8217;m sure Postgres does degrade quite well on a system with overcommit disabled, but you&#8217;re greatly underestimating how difficult it is to degrade gracefully in an OOM condition when memory is overcommitted. Consider that, by the time an OOM condition is reached, the paging daemon has been awake long enough to give every single memory access attempted by Postgres a non-zero risk of page faulting&#8211;code, data, heap, stack, anything. Postgres might squeak by because the memory it needs to touch in order to ROLLBACK is probably last in the pager&#8217;s LRU queue, but it does so on dumb luck, with a probability that has nothing to do with its own code&#8217;s carefulness and everything to do with what else is happening in the system. The &#8220;bizarre situation&#8221; is actually the common case, because once the paging deamon awakens having touched a page in the past gives you little protection. </p>
<p>By the way, calling free() doesn&#8217;t actually free memory just as calling malloc() doesn&#8217;t actually allocate it. To actually free physical pages, a malloc() implementation must unmap the address space to which they are mapped using sbrk() or munmap(), or must madvise(MADV_FREE) it. Both of these operations are quite expensive, so malloc() implementations amortize the cost. The default configuration of FreeBSD&#8217;s jemalloc, for example, unmaps in 1M (contiguous) chunks and lets 2M (discontiguous) accummulate before calling madvise().</p>
<p>In response to (2) I&#8217;ll stress again that degrading gracefully after hitting an RLIMIT_* is an entirely different beast than degrading gracefully after the paging daemon has injected non-determinism into your application. And of course it&#8217;s worse than that; you can&#8217;t even necessarily perform I/O in an OOM condition&#8211;the data structures that the syscalls build and submit to the block layer and I/O scheduler and so on obviously must be allocated first.</p>
<p>(I should concede, here, that on a system without swap your arguments are on much stronger ground.)</p>
<p>I don&#8217;t address the shared memory counting (3) because I agree with you. Hopefully the work discussed in the LWN article I mentioned, <a href="http://lwn.net/Articles/359998/" rel="nofollow">http://lwn.net/Articles/359998/</a>, will go somewhere.</p>
<p>With respect to (4): Yeah, you can make a guess based on allocated VM space, but it would give you a false positive more often than you think. Say you have a system with only 3M or so free and you&#8217;re sitting at a shell. Suppose further that you happen to have a pre-compiled &#8220;Hello, World&#8221; application sitting around. Chances are you wouldn&#8217;t be able to run it: the first thing it&#8217;s going to do when you run it is mmap() /lib/libc.so, which is probably 2M or so; then it&#8217;ll try to call printf(). Even if we ignore stdio&#8217;s internal buffers, printf() typically calls malloc() internally while building it&#8217;s string, and the first thing malloc() is going to do is sbrk()/mmap() at least 1M of address space because it&#8217;s trying to amortize that cost over lots of malloc()s, and doesn&#8217;t realize the program is about to terminate. Boom, allocated VM space exceeds available memory&#8211;never mind the fact such an process probably has a peak unshared memory footprint of like 16K.</p>
<p>No system call is cheap on the time scale relevant to a good general purpose malloc implementation, unfortunately.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Dunstan</title>
		<link>http://thoughts.j-davis.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-156</link>
		<dc:creator>Andrew Dunstan</dc:creator>
		<pubDate>Wed, 02 Dec 2009 00:52:23 +0000</pubDate>
		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=200#comment-156</guid>
		<description>http://lwn.net/Articles/104185/ contains my favourite explanation of OOM.

I recently turned on strict accounting on a client&#039;s machine to try to minimise the OOM danger. Sadly, while Postgres might be well behaved, other apps are no, and the system very quickly went belly up. Saying &quot;fix your leaky app&quot; doesn&#039;t help much when the app is PHP. I had to turn it off again and hope like hell we don&#039;t have a nasty incident.</description>
		<content:encoded><![CDATA[<p><a href="http://lwn.net/Articles/104185/" rel="nofollow">http://lwn.net/Articles/104185/</a> contains my favourite explanation of OOM.</p>
<p>I recently turned on strict accounting on a client&#8217;s machine to try to minimise the OOM danger. Sadly, while Postgres might be well behaved, other apps are no, and the system very quickly went belly up. Saying &#8220;fix your leaky app&#8221; doesn&#8217;t help much when the app is PHP. I had to turn it off again and hope like hell we don&#8217;t have a nasty incident.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
