Introduction
This guide goes over some of the basics of how to gauge the amount of resources you should add to a server in order to better meet the demands being placed on the server.
We have a similar guide the provides a different type of approach to solving this problem here:
What are the hardware requirements for my server?
The approach described in this guide focuses on how to plan for adding additional resources to an existing server, or how to plan resources on a new server that you plan to migrate your existing server to.
Please also keep in mind that this guide is not a replacement for the professional analysis and opinion of a systems administrator that has training, skills, and experience in determining resource usage needs for a Linux server that runs web-related technologies.
Additionally, cPanel has specific minimum and recommended hardware requirements documented. Please review those requirements if you are planning on installing cPanel on a new server. You and your systems administrator should also consider the fact that the documented requirements assume basic usage. If you have a very busy server, or you plan on running a very busy server, you should allocate more resources than cPanel's system requirements specify.
cPanel Docs - System Requirements
The Basic Idea
In order to determine the resources you need for an upgrade, you need to the following:
- What resources, and the number of resources currently available to you
- What resources, and the number of resources you are currently using
- The difference of the resources currently available vs the resources currently used
- The timeline and projected growth in demand for resources in the future
If you don't need to be precise about the upgrade, you can just do rough estimates for this information and make an educated guess to find out how many resources to allocate for the upgrade.
If you need to be more precise, you can read on.
The Methodology
It is important to understand the basic process and concepts involved in properly planning the resource allocation of a server upgrade.
The Process Loop
Proper upgrade planning will take the form of multiple iterative steps. Getting an idea for basic loop can be helpful for efficiently moving through this process.
Step 1 - Data Collection
Raw data collection is the genesis of planning a server upgrade.
This can be as precise and formal, or haphazard and casual as you prefer, but your end results rely on the quality of the data that you're working from.
What To Collect
- CPU usage over time
This should include information about what kinds of CPU usage occurred such as User, System, IOWait etc.
This will be used for determining the number of cores to be ordered on the upgrade
- Memory usage over time
This should include information about %commit, swap in and out rates, memory available, OOM occurrences etc.
- Disk usage over time
This should include information about %util, %iowait, disk sizes, disk space available and used over time, types of disks, disk read and write speed etc.
- Network utilization over time
This should include bandwidth usage, data in and out rates, data error rates, data transfer speed available etc.
- Load levels over time
While load levels do not tell the whole story, they can sometimes be useful to give a very rough overview of general resource usage over time. This would be used to quickly spot large spikes or dips in general resource usage but does not provide utility outside of that.
Collection Tools
Sysstat - The default on cPanel servers
cPanel servers running on CentOS typically have the sysstat set of utilities installed and enabled by default. Sysstat collects a wide range of system data, including all of the data mentioned above. Although the data collected by default is not ultra granular, it would be sufficient to plan an upgrade for most situations. If you are not sure if sysstat is already enabled and collecting data, a quick test would be to simply execute the following command:
sar
If you get data from that command, sysstat is already collecting data about your server. More information about analyzing this data is included later in this article.
In addition to the manual pages, a very good place to learn about sysstat capabilities is the GitHub repository:
GitHub - sysstat - System performance tools for the Linux operating system
There are a vast number of other options that you may install and configure for collecting resource usage data if sysstat does not meet your specific needs. Finding your favorite method is outside the scope of this resource, but I will offer the following alternative since it is a cPanel supported plugin:
How to install the Munin Monitoring Plugin on cPanel
Once you have collected data to work with, you can move on to the next step in the process loop, Data Analysis.
Step 2 - Data Analysis
Raw data is worthless unless you can extract some meaning or conclusion from it.
Some examples of default tools available on a cPanel server will be explained in more detail below, but please keep in mind that this is a topic of study and practice that can take years to master and perform effectively and efficiently. This guide provides only the most basic information on this.
After a first analysis, you may find that you need to collect supplemental data to enable you to do a subsequent round of analysis which may then produce actionable conclusions. This is where the iteration of the loop becomes more apparent.
Basic Data Review With Syssat
The primary tool that you'll use for basic data analysis with sysstat is the sar utility:
Viewing Stats On Different Dates
Sysstat stores it's data in individual files on a per day basis in the following directory. Each file is named sa## where ## represents the day of the month that the data was collected on.
/var/log/sa/sa##
To view the CPU statistics for the 12th run the following command:
sar -f /var/log/sa/sa12
Limiting Stats to Certain Time Periods Within the Day
You can specify a start and stop time to only view status for the time period of your interest.
To specify a start time of 2:00 AM and display all stats for the rest of the day on the 12th use the following command:
sar -s 02:00:00 -f /var/log/sa/sa12
To limit that time period to only 12 hours and omit stats after 2 PM as well, add the -e flag:
sar -e 14:00:00 -f /var/log/sa/sa12
CPU Statistics
The default for sar is to display CPU statistics so running it without any arguments will produce CPU stats:
# sar -s 21:30:00 -e 22:00:00 -f /var/log/sa/sa23
Linux 3.10.0-1127.18.2.el7.x86_64 (host.example.tld) 11/23/2020 _x86_64_ (1 CPU)
09:30:01 PM CPU %user %nice %system %iowait %steal %idle
09:40:01 PM all 0.67 0.13 0.39 0.01 0.01 98.80
09:50:01 PM all 0.68 0.06 0.42 0.03 0.01 98.80
Average: all 0.67 0.10 0.40 0.02 0.01 98.80
Memory Statistics
The %commit statistic can oftentims be very useful. This tells the administrator how much of the memory resources the kernel thinks that it needs to complete the workload at the time the measurement is recorded. This percentage is based off of the total of RAM and SWAP. If you add the amount of swap and ram together, and find that swap is 20% of that, a %commit of 80% means that the kernel wants to use ALL of the ram your server has available to it. Once %commit suggests that your server is wanting to use all of the RAM available, you would typically expect to see the Swap rates increase which then often leads to a decrease in overall server performance.
# sar -r -s 21:30:00 -e 22:00:00 -f /var/log/sa/sa23
Linux 3.10.0-1127.18.2.el7.x86_64 (host.example.tld) 11/23/2020 _x86_64_ (1 CPU)
09:30:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
09:40:01 PM 220692 1661312 88.27 115712 829924 1835828 76.29 722372 676848 92
09:50:01 PM 221200 1660804 88.25 115772 826732 1830640 76.08 724352 673764 20
Average: 220946 1661058 88.26 115742 828328 1833234 76.19 723362 675306 56
Swap Rates
Swap Rates are very important to pay attention to for planning memory allocation on a new server.
A small to moderate amount of swapping is expected and totally fine. If you find a pattern of swap rate spikes, or sustained high swap rates, you should probably consider planning to add additional RAM to the next server in most cases.
# sar -W -s 21:30:00 -e 22:00:00 -f /var/log/sa/sa23
Linux 3.10.0-1127.18.2.el7.x86_64 (host.example.tld) 11/23/2020 _x86_64_ (1 CPU)
09:30:01 PM pswpin/s pswpout/s
09:40:01 PM 0.00 0.00
09:50:01 PM 0.00 0.00
Average: 0.00 0.00
Disk Statistics
The -d flag tells sar to show disk stats, and the -p flag makes the disk names easier to read. Piping this output to the column utility is optional but can sometimes help to make the data more readable.
%util is probably the easiest metric to understand. Very high %util is one clue that you should consult with a systems administrator about planning to implement a storage system that will be capable of handling the data transfer speeds and throughput required of your application.
# sar -dp -s 21:30:00 -e 22:00:00 -f /var/log/sa/sa23 | column -t
Linux 3.10.0-1127.18.2.el7.x86_64 (host.example.tld) 11/23/2020 _x86_64_ (1 CPU)
09:30:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
09:40:01 PM sda 1.01 0.08 16.43 16.33 0.00 0.30 0.05 0.00
09:40:01 PM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:40:01 PM sdc 0.02 0.00 0.32 16.00 0.00 1.33 1.08 0.00
09:50:01 PM sda 1.28 0.08 21.07 16.59 0.00 0.43 0.06 0.01
09:50:01 PM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:50:01 PM sdc 0.02 0.00 0.32 16.00 0.00 14.25 14.25 0.03
Average: sda 1.14 0.08 18.75 16.47 0.00 0.37 0.05 0.01
Average: sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: sdc 0.02 0.00 0.32 16.00 0.00 7.79 7.67 0.02
Network Statistics
The -n DEV flag will show general network statistics. There are many other variations on this that you should read about in the man page.
# sar -n DEV -s 21:30:00 -e 22:00:00 -f /var/log/sa/sa23
Linux 3.10.0-1127.18.2.el7.x86_64 (host.example.tld) 11/23/2020 _x86_64_ (1 CPU)
09:30:01 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
09:40:01 PM eth0 0.62 0.34 0.07 0.08 0.00 0.00 0.00
09:40:01 PM lo 0.54 0.54 0.13 0.13 0.00 0.00 0.00
09:50:01 PM eth0 0.83 0.55 0.09 0.13 0.00 0.00 0.00
09:50:01 PM lo 0.54 0.54 0.13 0.13 0.00 0.00 0.00
Average: eth0 0.72 0.44 0.08 0.10 0.00 0.00 0.00
Average: lo 0.54 0.54 0.13 0.13 0.00 0.00 0.00
Load Over Time:
[root@srv0011 ~]# sar -q -s 21:30:00 -e 22:00:00 -f /var/log/sa/sa23
Linux 3.10.0-1127.18.2.el7.x86_64 (host.example.tld) 11/23/2020 _x86_64_ (1 CPU)
09:30:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
09:40:01 PM 2 202 0.03 0.03 0.05 0
09:50:01 PM 2 204 0.24 0.06 0.06 0
Average: 2 203 0.14 0.04 0.06 0
Producing Actionable Conclusions Based on Data Review
The most basic form of pulling actionable conclusions from this is to subtract the resources currently available from the resources currently used, which will give you the number of idle resources available.
If the resulting difference is a negative number, that means that your server is already in a deficit.
Once you have the resulting difference, you should also determine the expected lifetime of the upgraded server, or the amount of time you expect it will be before you perform another upgrade.
Then, look at the rate of increase for the resource usage over time. Try to extrapolate that increase across the expected lifetime of your upcoming upgrade to determine how much more resource usage will be needed by the end of the lifecycle.
Add the total expected increase to the resource usage deficit if there was one, or subtract the existing resource surplus from the expected increase to come up with the final total estimated resource allocation for the upgrade. You'll need to do this for each type of resource individually.
Sometimes, the review will have highlighted the fact that you still do not have enough information to make a confident decision about what upgrades are required to accomplish your goals. At this point, you should either move on to the Event Simulation stage to augment the data you already have and study the potential situations your server will, and may, encounter or move back to the original basic data collection step to start over with new data or add more collection methods for different data.
Step 3 - Event Simulation
A more common term for this would be load or stress testing. Event simulation is something that may be omitted based on your needs. If you are running mission-critical software, or offer some sort of uptime guarantee, event simulation should be a requirement for your organization in the planning stages of an upgrade. Accidents, inconveniences, and disasters happen every day all around the world, so you should build these possibilities into your planning.
This step is actually a form of data collection which is outlined in step one, however standard data collection is focused on real-life use and events that have happened in the past.
Event simulation is a form of data collection where you look to the future and try to generate data about what may occur so that you can be prepared for those things ahead of time.
Without first collecting data about the real-life use and events in step one, you have no guiding information to help with efficiently simulating events to be prepared for in the future.
The term Event Simulation is used here to emphasize a deeper level of thinking about potential occurrences than typical load testing may suggest. You might load test a website with Apache Bench and determine that website would use X resources to run. That's great, but Event Simulation would look at things in a less artificial way. For Event Simulation, you consider the ordinary, probable, and extraordinary types of events that do occur and could occur for the expected lifetime of the upgraded server.
For example, the demands placed on a server from HTTP traffic that can be simulated by Apache Bench, are not the only demands placed on a server. What happens when backups are generated at the same time that the server is also experiencing ordinary, probable, and extraordinary levels of HTTP traffic? What about a situation where an I/O heavy migration to a secondary server is required due to a DoS attack? What about the combination of these situations on top of typical SMTP traffic?
At a minimum for mission-critical software, your systems administrator should have a written timeline, plan, and hypothesis for how these situations would be handled on the upgraded server. Ideally, these situations would actually be tested and simulated on real hardware.
If occasional downtime is not a major concern for your application or organization, the significant time and expenses required to undertake the task of evaluating Event Simulation may not be worth pursuing.
Once you have completed an iteration of Event Simulation you should either return to step one, or step two based on the outcome.