There have been a lot of studies on best practices for managing storage I/O and its complexity. When it comes to storage capacity, controlling I/O and maintaining performance isn’t easy for small or large SAN environments. So it begs the question: How can we optimize our environments so we could support large amounts of bandwidth while supporting as many I/Os (transactions) as possible?
From a management standpoint, resources that are considered in maintaining I/O performance are also: IOPs, latency, response time, bandwidth, and throughput. All these metrics have to be considered among a shared pool of resources within a virtual environment and understanding the best decisions to make to optimize all other areas of the environment. Years and years we’ve been seeking for the intelligence to not only manage I/O intensive environments but drive performance throughout software-defined datacenters. Relatively speaking, it is easier to keep track of capacity than performance. How do you know when additional performance is required for your virtual environment?
If we were to look at performance issues on a disk, it’s easy to say additional performance of the disk is stressed. Now it is up to IT operators to determine what factors lead to a stressed disk. Through an understanding of read and write processes, one can make a determination of how to configure the environment to decompress I/O intensity.
Data is usually written to a page cache which serves as an area of memory. Along with this process, pages become “dirty” due to new written held page data. The dirty pages are then flushed to the device queue during the kernel process to satisfy I/O requests and retrieve those files from the physical disk. Usually I/O bottlenecks occur when “dirty” pages are flushed to disk if journaling disabled (this is a practice that IT professionals use to make this a one-step writing process to decrease read-write latency). Due to unpredictable I/O demands by applications, there are a lot of things to consider. One must understand the bandwidth and its limitations when data is flushed. One must consider the rate at which an application writes per second. One must consider if all flush threads are used, how will that affect the page cache and the bandwidth? There are so many layers of complexity on the storage level alone and so many moving parts.
What if we add another layer of complexity on the application layer and factoring in JVM garbage collectors?
All these processes convey the complexity of managing and controlling I/O intensive workloads. Over time, new requirements have naturally formed within the IT industry in order to maintain the performance of software-defined data centers. Every IT operator aims to optimize every level of the stack, from applications, hosts, and storage. From heaps, threads, and garbage collection of an application, to memory, cpu, and network utilization from the host to IO usage from the storage. With so many moving parts and complexities among different areas of the environment, it is required to understand how to properly allocate resources amongst a complex virtual environment and understand the implication of every decision made in an attempt to optimize performance. At the end of the day, we all want the best quality of food delivered to us in the least amount of time without any disruption in service. That alludes to application-users getting the best quality of service and IT operators preventing bottlenecks. But how can this be done?