Saturday, September 10, 2011

Concept about Linux Page Cache and pdflush

■ Requirement :  Explanation on page cache & pdflush
■ OS Environment : Linux, RHEL, Centos
■ ReSolution : 

Concept about Linux Page Cache and pdflush :

          When we try to write data, Linux caches this information in an area of memory called the page cache. We can check this cache memory using free, vmstat or top command. Even we can get information in /proc/meminfo.

        As pages are written, the size of the "Dirty" section will increase. Once writes to disk have begun, you'll see the "Writeback" figure go up until the write is finished. It can be very hard to actually catch the Writeback value going high, as its value is very transient and only increases during the brief period when I/O is queued but not yet written.

pdflush (A kernel thread) :

           Linux usually writes data out of the page cache using a process called pdflush. At any moment, between 2 and 8 pdflush threads are running on the system. You can monitor how many are active by looking at /proc/sys/vm/nr_pdflush_threads. Whenever all existing pdflush threads are busy for at least one second, an additional pdflush daemon is spawned. The new ones try to write back data to device queues that are not congested, aiming to have each device that's active get its own thread flushing data to that device. Each time a second has passed without any pdflush activity, one of the threads is removed. There are tunables for adjusting the minimum and maximum number of pdflush processes, but it's very rare they need to be adjusted.

Tune pdflush :

Exactly what each pdflush thread does is controlled by a series of parameters in /proc/sys/vm:

1. /proc/sys/vm/dirty_writeback_centisecs (default 500): In hundredths of a second, this is how often pdflush wakes up to write data to disk. The default wakes up the two (or more) active threads every five seconds.

2. /proc/sys/vm/dirty_expire_centiseconds (default 3000): In hundredths of a second, how long data can be in the page cache before it's considered expired and must be written at the next opportunity. Note that this default is very long: a full 30 seconds. That means that under normal circumstances, unless you write enough to trigger the other pdflush method, Linux won't actually commit anything you write until 30 seconds later.

3. /proc/sys/vm/dirty_background_ratio (default 10): Maximum percentage of active that can be filled with dirty pages before pdflush begins to write them

Note that some kernel versions may internally put a lower bound on this value at 5%. So on the system above, where this figure gives 2.5GB, with the default of 10% the system actually begins writing when the total for Dirty pages is slightly less than 250MB--not the 400MB you'd expect based on the total memory figure.

4. /proc/sys/vm/dirty_ratio (default 40): Maximum percentage of total memory that can be filled with dirty pages before processes are forced to write dirty buffers themselves during their time slice instead of being allowed to do more writes.

Note that all processes are blocked for writes when this happens, not just the one that filled the write buffers. This can cause what is perceived as an unfair behavior where one "write-hog" process can block all I/O on the system. The classic way to trigger this behavior is to execute a script that does "dd if=/dev/zero of=hog" and watch what happens.

do like : #dd if=/dev/zero of=hog in one terminal and on other terminal do #watch cat /proc/meminfo

When does pdflush write?

       Data written to disk will sit in memory until either a) they're more than 30 seconds old, or b) the dirty pages have consumed more than 10% of the active, working memory.

Tuning Recommendations for write-heavy operations :

Important : The usual issue that people who are writing heavily encounter is that Linux buffers too much information at once, in its attempt to improve efficiency. This is particularly troublesome for operations that require synchronizing the file-system using system calls like fsync. If there is a lot of data in the buffer cache when this call is made, the system can FREEZE for quite some time to process the sync.

dirty_background_ratio: Primary tunable to adjust, probably downward. If your goal is to reduce the amount of data Linux keeps cached in memory, so that it writes it more consistently to the disk rather than in a batch, lowering dirty_background_ratio is the most effective way to do that. It is more likely the default is too large in situations where the system has large amounts of memory and/or slow physical I/O.

dirty_ratio: Secondary tunable to adjust only for some workloads. Applications that can cope with their writes being blocked altogether might benefit from substantially lowering this value. It is easier to encounter when reducing dirty_ratio setting below its default.

dirty_expire_centisecs: Test lowering, but not to extremely low levels. Attempting to speed how long pages sit dirty in memory can be accomplished here, but this will considerably slow average I/O speed because of how much less efficient this is. This is particularly true on systems with slow physical I/O to disk. Because of the way the dirty page writing mechanism works, trying to lower this value to be very quick (less than a few seconds) is unlikely to work well. Constantly trying to write dirty pages out will just trigger the I/O congestion code more frequently.

dirty_writeback_centisecs: Leave alone. The timing of pdflush threads set by this parameter is so complicated by rules in the kernel code for things like write congestion that adjusting this tunable is unlikely to cause any real effect. It's generally advisable to keep it at the default so that this internal timing tuning matches the frequency at which pdflush runs.

Statistical data :


$ free
total used free shared buffers cached
Mem: 4040360 4012200 28160 0 176628 3571348
-/+ buffers/cache: 264224 3776136
Swap: 4200956 12184 4188772
$

In this example the total amount of available memory is 4040360 KB. 264224 KB are used by processes and 3776136 KB are free for other applications. Don't get confused by the first line which shows that 28160KB are free. Using available memory for buffers (file system metadata) and cache (pages with actual contents of files or block devices) helps the system to run faster because disk information is already in memory which saves I/O.

Swap memory : An addition memory taken from harddisk and this will be used in addition with RAM. Dirty data may reside here too and can be directly move to disk for writing.

Value can be viewed by :

grep SwapTotal /proc/meminfo
cat /proc/swaps
free


Shared Memory : A part of RAM which is used for sharing by processes. Shared memory allows processes to access common structures and data by placing them in shared memory segments. It's the fastest form of Interprocess Communication (IPC) available since no kernel involvement occurs when data is passed between the processes. In fact, data does not need to be copied between the processes.

Check shared memory settings : ipcs -lm
See all chared memory : ipcs -m
Details of segment : ipcs -m -i
Remove segment : ipcrm shm

Check semaphore value : ipcs -ls

Change its value : echo 250 32000 100 128 > /proc/sys/kernel/sem

Buffer cache : The is subset of pagecache which stores files in memory.

IO Request Queue Parameters:

nr_requests : This file sets the depth of the request queue. nr_requests sets the maximum number of disk I/O requests that can be queued up. The default value for this is dependent on the selected scheduler.

read_ahead_kb : This file sets the size of read-aheads, in kilobytes. the I/O subsystem will enable read-aheads once it detects a sequential disk block access. This file
sets the amount of data to be “pre-fetched” for an application and cached in memory to improve read response time.

The tunable variables for the cfq scheduler are set in files found under /sys/block// queue/iosched/. These files are:

quantum : Total number of requests to be moved from internal queues to the dispatch queue in each cycle.

queued : Maximum number of requests allowed per internal queue.

Prioritizing I/O Bandwidth for Specific Processes : When the cfq scheduler is used, you can adjust the I/O throughput for a specific process using ionice. ionice allows you to assign any of the following scheduling classes to a program:

• idle (lowest priority)
• best effort (default priority)
• real-time (highest priority)

For more information about ionice, scheduling classes, and scheduling priorities, refer to man ionice.

Deadline scheduler : The deadline scheduler aims to keep latency low, which is ideal for real-time workloads. On servers that receive numerous small requests, the deadline scheduler can help by reducing resource management overhead. This is achieved by ensuring that an application has a relatively low number of outstanding requests at any one time. The tunable variables for the deadline scheduler are set in files found under /sys/
block//queue/iosched/. These files are:

read_expire : The amount of time (in milliseconds) before each read I/O request expires. Since read requests are generally more important than write requests, this is the primary tunable option for the deadline scheduler.

write_expire : The amount of time (in milliseconds) before each write I/O request expires.

fifo_batch : When a request expires, it is moved to a "dispatch" queue for immediate servicing. These expired requests are moved by batch. fifo_batch specifies how many requests are included in each batch.

writes_starved : Determines the priority of reads over writes. writes_starved specifies how many read requests should be moved to the dispatch queue before any write requests are moved.

front_merges : In some instances, a request that enters the deadline scheduler may be contiguous to another request in that queue. When this occurs, the new request is normally merged to the back of the queue.

front_merges controls whether such requests should be merged to the front of the queue instead. To enable this, set front_merges to 1. front_merges is disabled by default (i.e. set to 0).


Anticipatory Scheduler: The tunable variables for the anticipatory scheduler are set in files found under /sys/ block//queue/iosched/. These files are:

read_expire :
The amount of time (in milliseconds) before each read I/O request expires. Once a read or write request expires, it is serviced immediately, regardless of its targeted block device. This tuning option is similar to the read_expire option of the deadline scheduler Read requests are generally more important than write requests; as such, it is advisable to issue a faster expiration time to read_expire. In most cases, this is half of write_expire. For example, if write_expire is set at 248, it is advisable to set read_expire to 124.

write_expire : The amount of time (in milliseconds) before each write I/O request expires.

read_batch_expire : The amount of time (in milliseconds) that the I/O subsystem should spend servicing a batch of read requests before servicing pending write batches (if there are any). . Also, read_batch_expire is typically set as a multiple of read_expire.

write_batch_expire : The amount of time (in milliseconds) that the I/O subsystem should spend servicing a batch of write requests before servicing pending write batches.

antic_expire : The amount of time (in milliseconds) to wait for an application to issue another I/O request before moving on to a new request.

No comments:

Post a Comment