CPU consumption in Unix/Linux operating systems are studied using 8 different metrics: User CPU time, System CPU time, nice CPU time, Idle CPU time, Waiting CPU time, Hardware Interrupt CPU time, Software Interrupt CPU time, Stolen CPU time. In this article let us study about ‘waiting CPU time’.

What is ‘waiting’ CPU time?
Waiting CPU time indicates the amount of time CPU is waiting for the disk I/O or network I/O operations to complete. High waiting time indicates that the CPU is *stranded* because of the I/O operations on that device. For optimal performance, one should aim to keep the I/O waiting CPU time as low as possible. If waiting time is > 10% then it is worth investigating it.

You can visualize I/O waiting time through this analogy: Say there are hundreds of cars/bikes are waiting on a busy road for the traffic light to switch from ‘red’ to ‘green’. But due to some technical glitch, it takes long time for the traffic light to switch from ‘red’ to ‘green’ – then those hundreds of cars/bikes would get stranded unnecessarily. It will result in several undesirable side effects: passengers will reach their destination late, drivers can get frustrated and start to horn (noise pollution), since engines are on fuel will be wasted (air pollution)…

How to find ‘waiting’ CPU time?
Waiting CPU time can be found from the following sources:

a. You can use web-based root cause analysis tools like yCrash to report ‘waiting’ CPU time. Tool is capable of generating alerts if ‘waiting’ CPU time goes beyond the threshold.

b. ‘waiting’ CPU time is also reported in the Unix/Linux command line tool ‘top’ in the field ‘wa’ as highlighted in the below image.


Fig: ‘wa’ time in top

How to simulate high ‘waiting’ CPU time?
To simulate high ‘waiting’ CPU reporting let’s use BuggyApp. BuggyApp is an opensource java project which can simulate various sort of performance problems. When you launch BuggyApp with following arguments, it will cause the ‘waiting’ CPU consumption to spike up on the host.

java -jar buggyApp.jar PROBLEM_IO



Fig: You can see the waiting CPU time spike up to 75.9%

Now let’s see the source code in the BuggyApp which is causing the ‘waiting’ CPU time to spike up.

public class IODemo {
 
	public void start() {
 
		for (int counter =1; counter <= 5; ++counter) {
 
			// Launch 5 threads.
			new IOThread ("fileIO-" + counter + ".txt").start();
		}
	}
}
 
 
public class IOThread extends Thread {
 
	public String fileName;
 
	public static final String CONTENT = 
"Hello World! We are building a simple chaos engineering product here. \n" +
"Hello World! We are building a simple chaos engineering product here. \n" +
"Hello World! We are building a simple chaos engineering product here. \n" + 
"Hello World! We are building a simple chaos engineering product here. \n" 
"Hello World! We are building a simple chaos engineering product here. \n" + 
"Hello World! We are building a simple chaos engineering product here. \n" + 
"Hello World! We are building a simple chaos engineering product here. \n" + 
"Hello World! We are building a simple chaos engineering product here. \n"
 
	public IOThread(String fileName) {
		this.fileName = fileName;
	}
 
	public void run() {
 
		int counter = 0;
 
		// Loop infinitely trying to read and close the file.
		while (true) {
 
			// Write the contents to the file.
			FileUtil.write(fileName, CONTENT);
 
			// Read the contents from the file.
			FileUtil.read(fileName);
		}
        }
}

Here you can see that BuggyApp is launching 5 ‘IOThreads’ in ‘IODemo’ class. You can notice that ‘IOThread’ is going on an infinite while loop. In that loop it’s writing the content in to a file and reading the same content from the file. It is doing these 2 steps repeatedly again and again. Since writing contents and reading contents from the disk is a heavy I/O intensive operation ‘waiting’ CPU time is spiking up to 75.9%

How to resolve high ‘waiting’ time?
In case if your device is suffering from high I/O waiting time then you can follow the steps outlined below to reduce the I/O waiting time:

1. You can use the root cause analysis tools like yCrash which will point the lines of code in the application which is causing the high I/O waiting time.
2. You can optimize the application’s waiting time by doing the following:

* Reduce the number of database calls
* Optimize the database queries such that less data is returned from DB to app
* Reduce the number of network calls that is made to external applications
* Try to minimize the amount of payload that is sent between external applications and your application
* Try to reduce number of files that written to disk.
* Try to reduce amount of data read from the disk.
* Make sure only the essential log statements are written into the disk.

3. Make sure your OS is running on the latest version with all patches installed. This is not only good from the security perspective, but it will also improve the performance.
4. Make sure sufficient free memory is allocated on the device. Lack of free memory has two detrimental effects:

* If there is a lack of free memory, then processes will be swapped in and out of memory. Several pages will be written and read from the disks frequently. It will increase the disk I/O operations.
* If there is less free memory, then OS wouldn’t be able to cache frequently used disk blocks in memory. When frequently used disk blocks are cached, I/O waiting time will go down.

5. Keep your filesystem disk usage below 80% to avoid excessive fragmentation. When there is excessive disk fragmentation, I/O time will increase.
6. If all of the above steps fail, you may also consider upgrading your storage for better performance. You might consider switching to faster SSD, NVMe, SAN storage,…