Memory not freed?
Michael Bilow
mike at bilow.com
Wed Dec 6 19:22:08 EST 2000
On 2000-11-26 at 13:49 -0500, Derek D. Martin wrote:
> Interesting... I wouldn't have guessed this strategy. Which I suppose
> is at least partly why I'm not a kernel hacker! :)
Copy-on-write is somewhat notorious as it has been patented a number of
times by a number of different parties. See, by way of example, IBM's US
Patent 4,742,450, which covers a technique which was probably in common
use for 15 years or so before the patented method was supposedly invented.
Precisely because copy-on-write is such a clever idea, although
simultaneously discovered by hundreds if not thousands of competent
hardware and software designers at least a decade before anyone first
thought to patent it, those who independently discover it tend to be so
impressed with their own brilliance that a few dozen of them have trotted
right off to their patent lawyers.
> But I was thinking about your comments as pertains to the original
> poster's question about whether or not the OS keeps copies of dead
> applications' pages in memory to speed up future loads...
>
> From a user's perspective, most processes are started by their login
> shell (or a subshell under various circumstances, such as when
> starting X). The shell forks, creates a "copy" of itself which shares
> the same text and data pages, but then execs a different program. At
> that point, I would expect that the data pages of that process become
> independent of the original shell's data pages (except perhaps if it's
> another shell), since the two programs have nothing to do with
> eachother. Is that correct?
Child processes have a relationship which exists at a much higher level
than memory paging. From the point of view of the application running in
userland, its segments reside at some sort of address. In some
architectures, these may be arbitrary conventions, such as the text
segment always being numbered 1, the data segment always being numbered 2,
the stack segment always being numbered 3, and so on. Generally, these
segment numbers correspond to some segment register which is maintained by
the CPU hardware, and a large part of a task context consist in somehow
mapping this set of standardized numbers into some sort of global list of
segments maintained from the perspective of kernel mode.
On an Intel system, for example, the CPU has "segment registers" which
have historically been named CS, DS, ES, and SS (for "Code Segment," "Data
Segment," "Extra Segment," and "Stack Segment," respectively) since the
8086 processor generation. When protected mode was introduced with the
80286, the segment registers came to have a slightly different meaning and
the numbers contained in them became "selectors." The 80386 generation
added the FS and GS registers which could be used for optimizations, but
other than this the architecture is pretty much the same. Modifying the
contents of a segment register is an operation subject to privilege, since
a selector is really an index into some sort of "descriptor table" which
is maintained by the operating system to map these selectors into some
globally significant view of memory.
The segment described by each entry in the descriptor table can be either
read-only or read-write for the process. For reasons of efficiency, the
hardware itself is responsible (on an Intel system) for doing all of these
translations and checking to see if an access is allowed. In order to do
this, the CPU has a set of descriptor table registers which are loaded
with the global address of the appropriate descriptor table in memory in
order to manage context changes between processes.
The end result of this translation process into a globally significant
address produces, however, only a virtual address. The operating system
is free to map this virtual address into anywhere in physical memory it
can find space, or even to swap the data out to disk and mark the
descriptor as "not present." When physical memory is shared by multiple
instances of the same program as a result of a fork or otherwise, it
happens at this lower level so that different descriptor entries (and
possibly even different local descriptor tables) are created immediately,
even if they simply provide different virtual addresses for the same
physical memory. To add some more complexity, small parts of the segment
mapped from a decriptor, each known as a "page," might be independently
swappable on some architectures (including Intel).
The end result of all of this is that the processes running in userland
sharing the same physical memory do not see the same addresses into it,
and they are actually unable to know that they are sharing the same
physical memory because they cannot look into the internal tables of the
operating system kernel. It is up to the kernel, when it maintains the
various tables for the hardware, to keep track of what is being shared and
in what manner, and to handle copy-on-write and any other tasks necessary
to preserve the completeness of the illusion.
> So then, does this new process share memory pages with an existing
> copy of the same program? Here, I would think it would share text
> pages, but not data pages, since the state of one is totally different
> from the state of the other... I would intuitively think that it
> wouldn't be worth the effort of mapping the old data pages to the new
> process, since in many cases they will need to be changed immediately.
In fact, you are confusing a lot of very narrow and specifc terms. If my
data segment consists of a megabyte or so of data, and I change one byte,
then I might be able to get away with copying only the page which contains
that one byte and modifying the page table accordingly. On an Intel CPU
(under Linux, anwyay), a page is 4 KB, so this could save me a great deal
of efficiency in memory usage. Some optimizing linkers are actually smart
enough to group chunks of data together in such a way as to minimize page
faults, and we had such a tool for OS/2 which worked in concert with the
LX-format executable. I do not know if anyone has ever developed such an
optimizing linker for Linux ELF.
> Really, I think there are 3 cases of the fork scenario you describe:
>
> - process forks a copy of itself
> - process forks a new program, which is not already running
> - process forks a new program, which IS already running
>
> I'd be curious what happens in each case. If you care to comment,
> please do!
You are confusing fork and exec. Only a fork really has these kinds of
opptimizations available. In the first case where a process actually does
fork, both the text and data segments are simply mapped without being
copied until a write occurs, and then -- depending upon the architecture
-- either the whole data segment or a granular page will be copied.
In the second case, new memory is allocated. In the third case, where a
program execs (not forks) a new program, the text segment will be shared
but the data segment will not.
> P.S. as partial answer to the original poster's question, I believe
> what Linux does to aid in the start-up of recently terminated programs
> is keep the disk blocks containing the program in buffer cache, so long as
> there is enough free memory to do that.
This is generally true, although there is also a more sophisticated
option. For a library, which might be loaded into its own text and data
segments separate from the process which is calling into it, a daemon
could be left running which keeps the segments valid and resident,
awaiting startup of a new instance. This is rarely done on Linux, partly
because it is stupid, but also because Linux has fairly low startup
latency for most reasonable binaries.
Instead, this is the technique which is very widely used on Windows, and
the main reason why things run in the "System Tray" in Windows is to try
to gain the upper hand against other programs. Of course, this quickly
leads to a point of diminishing returns, where a few dozen "systray
applets" are installed by various aggressive software vendors, all of
which themselves consume available memory. Eventually, the only result is
that the swap file grows and the system slows to a crawl.
-- Mike
-
Subcription/unsubscription/info requests: send e-mail with
"subscribe", "unsubscribe", or "info" on the first line of the
message body to discuss-request at blu.org (Subject line is ignored).
More information about the Discuss
mailing list