Virtual Instance Discovery and Analysis

As an Infrastructure Engineer or Architect you need to have a good grasp on what systems comprise your environment. In the past this was somewhat straight forward. You kept a configuration item database in your CMDB and teams had their workbooks and playbooks. However in this new world of DevOps and CI-CD and their automation tool sets such as Terraform, Chef, Ansible, Bladelogic, and many others; developers can stand up their own virtual instances and tare them down. This can make it hard to have a complete picture of your environment. This can be especially difficult for storage infrastructure since many virtual instances can be deployed on large data stores and when there is an IO performance problem tracking down the related hardware can be like following the rabbit to wonderland.

I ran into this while leading several projects for storage infrastructure servicing an ESX environment and developed a powercli script to pull the necessary virtual instance and data-store data from the VSphere systems. First you will need to install powercli for VMware which can be found here:
https://developer.vmware.com/web/tool/12.4/vmware-powercli

Next is the powercli script. There are many examples of powercli scripts which can be found with a quick search on the internet. To use power-shell with VMWare’s powercli the first step is to set the powercli configuration. If you aren’t using signed certificates then you will want to ignore invalid certificates and you will want to step through each vsi server so you don’t get interleaved output. Then you will need to store credentials to use for each login using the Get-Credential function which may be stored as xml to a file for subsequent use.

Now comes our code which will include a list of VSI servers (VCenter) to be queried. Loop through these and use the Get-VM, Get-HardDisk, Get-Datastore and Get-Datacenter to pull the data from each VSI VCenter server. Output a line for each virtual machine instance in a comma delimited format. This can then be imported into excel and cross referenced with CMDB data and storage reports. Using Excel for Infrastructure Analysis is another topic for another day.

#################################################################################
# Get List of VMs, HDD Size, Datastores Size, Folder name, Operating System
#
# Version: 2.0  Author: Hicham Hassan IBM  - Feb 2021
# Version: 3.0  Author: Devin Adint IBM - Feb 2021 - added credentials, vcenter
#               server list and loop, and requires powercli module 
#               added to fields device-diskname, IP, DataStoreGB, etc.
#################################################################################

#Check that Modules are available and loaded - exit if not found
#requires -module VMware.VimAutomation.Core
Set-PowerCLIConfiguration -InvalidCertificateAction Ignore -DefaultVIServerMode single -Confirm:$false


# Load my credentials from file or if not exist request credentials and save to file
if (-not (Test-Path ".\userid.cred" )) {
$MyCredentials = Get-Credential
$MyCredentials | Export-CliXml -Path ".\userid.cred"
} else {
$MyCredentials = IMPORT-CLIXML ".\userid.cred"
}

# Load list of vCenter Servers from File
if (Test-Path ".\viservers.csv" ) {
$VMSearch = Import-Csv ".\viservers.csv" -Header ("VIServer","DataCenter")
$VIServer  = $VMSearch | % { $_.VIServer }
$DataCenter = $VMSearch | % { $_.DataCenter }
} else {
$VIServer=@( "esx001.domain.com", 
"esx002.domain.com", 
"esx003.domain.com", 
"esx004.domain.com");
}

# 
Foreach ($vCenter in $VIServer){

    #Connects to the vCenter
    connect-viserver -Server $vCenter -Credential $MyCredentials
	
Get-VM -PipelineVariable vm |
ForEach-Object -Process {
    Get-HardDisk -VM $vm |
    Select @{N='VCName';E={$vCenter}},
		@{N="DCname";E={(Get-Datacenter -Cluster $vm.VMhost.Parent).Name}},
		@{N='Cluster';E={$vm.VMhost.Parent}},
		@{N='VMHost';E={$vm.VMhost.Name}},
        @{N='VMName';E={$vm.Name}},
        @{N='PowerState';E={$vm.PowerState}},
        @{N='OS';E={$vm.Guest.OSFullName}},
        @{N='OSConfigured';E={$vm.ExtensionData.Config.GuestFullName}},
		@{N="IP";E={$vm.Guest.IPaddress}},
        @{N='HDDName';E={$_.Name}},
        @{N='Datastore';E={
            $script:ds = Get-Datastore -RelatedObject $vm
            $script:ds.Name}},
		 @{N='DataStoreGB';E={[math]::Round($script:ds.CapacityGB)}},
		 @{N='DataStoreFreeGB';E={[math]::Round($script:ds.FreeSpaceGB)}},

        @{N='HDDAllocatedGB';E={[math]::Round($_.CapacityGB)}},
		@{N='Device';E={$script:ds.Extensiondata.Info.Vmfs.Extent[0].DiskName}},
		@{N="Type";E={$script:ds.Extensiondata.Info.Vmfs.Type.ToLower() + $script:ds.Extensiondata.Info.Vmfs.MajorVersion}},
		@{N="Status";E={$script:ds.Extensiondata.OverallStatus}},
        StorageFormat

} | Export-Csv ".\${vCenter}_report.csv" -NoTypeInformation -UseCulture

Disconnect-VIServer -Confirm:$false

}	# foreach vCenter

    #Export-Csv .\report.csv -NoTypeInformation -UseCulture
	#Export-Excel -Path C:\Temp\Report.xlsx -WorkSheetname Usage -TableName Usage

Optimizing Disk IO Through Abstraction

To Engineer or Not To…

When disk capacity is released to a new application or service many times the projects do not consider how best to use the storage that has been provided. Essentially the approaches fall into one of two schools of thought. The first is to reduce upfront engineering into a couple design options and resolve issues when they arise. The second is to engineer several solution sets with variable parameters that will provide a broader pallet of solutions and policies from which an appropriate solution may be selected.

Reduced Simplified Engineering

  • Apply one of a couple infrastructure designs to a project.
  • This approach involves less work upfront, has a simpler execution and involves less work gathering requirements.
  • Potentially more time and effort will be spent resolving issues when resources and design are insufficient.

Solution Engineering

  • Develop several standard solution sets and models and document policies and procedures which will be used to fit applications/components into these models.
  • This approach involves substantial requirements gathering and entails more work and complexity.
  • The result is a more efficient use of resources and less time spent resolving issues because the design and resources should better fit the system needs.

Engineering Disk Performance

When it comes to disk performance the cost of not considering how to optimize the use of the storage infrastructure can be very significant. Consider the IO operation cycle time table. The order of magnitude difference between CPU operations, memory operations and that of device IO is significant. If cycle time is scaled up to a second the comparative difference is between seconds and that of device IO taking months to years to complete. Poorly orchestrated device use cases can result in significant impact. A memory miss will only cost a few hundred nanoseconds but a disk IO cycle miss could cost a couple dozen milliseconds. To put it in the perspective of scale we are talking about the difference between four minutes and eight months.

 

Device Cycle Time Cycle in Seconds Scaled Cycle
CPU 1 nanosecond 1.00×10^-9 1 seconds
CPU Register 5 nanoseconds 5.00×10^-9 5 seconds
Memory 100 nanoseconds 1.00×10^-7 2 minutes
Disk 10 milliseconds 1.00×10^-2 4 months
NFS Op 50 milliseconds 5.00×10^-2 2 years
Tape 10 seconds 1.00×10^1 3 centuries

 

IO Op Cycle Time Scales

How IO devices are use by systems can impact device performance in several ways. First, between the system and the device is usually a cache which attempts to predict what data will be needed and pre-stages it in faster memory in anticipation of any potential access requests. Traditional cache algorithms identify data that is accessed in ordered proximity on the IO device and then reads ahead to pre-fetch subsequent data. These pages are then left in cache until they are aged out due to subsequent access to other data on the device. The result is that sequential IOs will experience better performance because the data will be in cache. However random IOs and IOs that occur sporadically will require direct access to the slower storage device. Newer caching algorithms such as SARC have algorithms designed to improve anticipation of random IO but are no substitute for considering data layout on the devices. An example of incorrect data layout would be over aggressive server side striping using volume management which can cause sequential IOs on the system to appear as random on the storage device; in essence, circumventing caching. The second area of consideration is the storage device back-end performance. Back-end performance can be impacted by IO sizes and distributions across array boundaries. The most significant impact comes from conflicting IOs that request data from different portions of storage device media at the same time. The result is what is termed “thrashing” because the device access mechanism must be readdressed from one part of the media to another requiring increased times for continuous repositioning.

Optimizing the Layout of Information on Storage

Component Activity Diagram

Component Activity Diagram

With all these potential factors that can impact storage performance how can a method for optimizing how storage devices are used be formulated? Some of the decisions have to be made on the back-end by the Storage Analysts that configure and present the storage to hosts. Once storage is presented to a host how the storage is used in light of these factors needs to be considered. Back-end configuration considerations vary by storage device implementation but most storage sub-systems have facilities that automate redistribution of data to balance IOs across the device media. Directions on optimum logical device configuration on the storage system are available in architecture documentation and from field engineers. But once these configured devices are made available to a system a method of laying out data needs to be considered that will increase sequential IO and reduce IO conflict potential. To develop a standard method rather then create unique custom solutions for each application component is more advantageous to establish a more predictable, re-creatable and supportable environment. This method should consider three factors. It should consider what the component does, what the component stores and how large of an implementation the component is supporting.

Categorizing Component Functionality

Application Functional Categories

Application Functional Categories

The first step is to determine what an application component does. Some applications are a single entity while other applications are non-monolithic and break the work they perform down into subsets of functionality performed by components. All applications or component functionality can be categorized into one or more of four possible categories. The first category is client servers which include; web servers, web portals, portal servers, SNA gateways, and Citrix servers. This category is mostly comprised of executables and static configurations or models with some temporary file requirements. The next category of functionality is utility servers. Utility servers provide peripheral or supporting functionality such as reporting, authentication and include; LDAP servers, Active Directory servers, messaging, certificate servers, file transfer servers, EDI and OLAP servers. This category has a broad constituency including executables along with configuration files, temporary file space and data space for certificates and directory structures. The IO profiles can range from sporadic to rather intense when dealing with OLAP servers. The next category is applications servers that provide framework for software development through distributed objects and APIs that provide business logic and business process functionality and include, IBM Websphere, GlassFish, DataStage TX, BEA Tuxedo, and AX Workflow manager. This category generally has executables, configuration files and uses some temporary space but interfaces with data servers for any major data functions. The final category of application components are data servers that manage access to structured, semi-structured and unstructured data. Data servers include databases such as Oracle and DB2 and document management applications such as Documentum. This category generally includes executables with configuration files and large data stores that are managed by a data engine and tend to perform the brunt of data manipulation. Categorizing components using these definitions should help to narrow down storage infrastructure performance requirements.

Categorizing Storage IO Activity

The next step is to determine what types of data is stored and how it is accessed by a component. The types of data that an application component stores and accesses fall into one of four categories based upon how the data is used and accessed, see the diagram below of the categories.

Types of Storage Access

Types of Storage Access

The first category is static data which is comprised of executables, configuration files, models, static images, templates, style sheets, etc. This data usually gets loaded once when a process starts and much of what is needed is instantiated in the systems memory as a running process or cached frequently used templates and images files and therefore IO happens at start up and then less frequently there after. The next category is transient data which is comprised of temporary files, database transaction logs, database redo logs, spools, queues, temporary database tables etc. This data is usually only accessed a few times and then either purged or over-written in a cyclical file. The IO profile is more random and sporadic and in the case of data servers usually works in concert with access to retained active data. The next category is active data which includes RDBMS data, HDMS data, file shares, document or image stores, etc. This is the retained working data set used by an application component and typically involves heavier sequential IO operations. The final category is reference data and includes RDBMS indexes, HDMS directory data and meta data and is generally referenced in concert with access to retained active data to identify what active data needs to be accessed based upon criteria stored in the reference data. Categorizing the file systems/directories specified by a component into one of these categories will help to identify potential data structuring to achieve an isolation of IO to reduce contention and to consolidate similar IO patterns there by increasing sequential IO potential.

Determining a Size Model

Model Description
Small 50 – 256 GB or up to 1200 IO/s
Medium 256 – 750 GB or up to 1800 IO/s
Large 750 – 5000GB or up to 3600 IO/s
Huge >5000GB or >3600 IO/s

Now that we know what the application component does and we know what types of data files it uses the next step is to determine how large of an implementation this component is supporting. The size should be made based upon the size of the file spaces and/or the estimated number of IOs. This model should then be used to determine if the file spaces need to be separated and to what extent based upon the storage activity category.

Using OS Storage Abstractions to Optimize Layout and Use

Finally, now that much has been learned about the component being implemented and a determination of file space and implementation size; what mechanism may be used to manage IOs? Most operating systems have facilities for creating logical abstractions of storage. The diagram of OS Storage Abstraction below depicts a typical volume management facility. This is true of HP LVM, AIX Volume Manager and Veritas Volume Manager and is even true of Solaris z-pools and zfs file systems.

OS Storage Abstraction Functionality

OS Storage Abstraction Functionality

These abstraction structures provide the means to limit storage access in three ways. Device groups, Volume Groups or Windows labeled Disks provide a means of isolating IO. The logical partitions are distributed across the presented disks based upon several parameters but all their IO is limited to the devices within the group. The logical partitions and file systems limit data growth and directories provide a logical segregation of files into hierarchies for organizational purposes. Once an implementation size is determined a model may be assigned. In a small model all the storage activity categories may be handled by a single device group. In a medium model the reference and transient data should be placed in a device group and the static and active data placed in another device group. In a large model Static and reference data are placed in one device group, transient data into a second and active data into a third device group and file systems created per specifications. Finally if a component is determined to have a huge IO / Size requirement four or even more device groups should be considered including multiple partitioning of transaction, active and reference data.

Division of the volume groups may be tackled in several ways. One way would be to track all device group names and their purpose and add new device groups based upon the new application components the service or application they are supporting. This becomes more complex in shared service environments where one system may host several instances of a component. An easier way to accomplish this is through using a device group naming convention. With a naming scheme analysts don’t need to know what device groups already exist, at a glance how storage is used on a system can be seen from the device group names and it eliminates the creation of unnecessary volume groups since any new file systems that fall into pre-existing categories will end up being assigned to the appropriate pre-existing device group. Essentially the used device group names are self-tracking since the application of the naming process should result in the same name or a new name when applicable. The implementation of the naming scheme can be automated in a spreadsheet form using lookups based upon the models and categorizations of the file systems requested in the form. The naming scheme I have used has the following form:

  • [app abbreviation up to 4 characters][00 Instance number][storage activity category(STAR)][Environment Code]
  • example: Oracle People Soft Data Table Production = orps1ap
    Oracle People Soft Index Table Production = orps1rp
Component Code Storage Access Code Environment Code
Client C Static S Development D
Utility U Transient T Integration Test I
Application A Active A Acceptance Test A
Data D Reference R Quality Assurance Q
Production P
Disaster Recovery
R

Conclusion

Exploring the structures that comprise an application component and then making informed judgments based upon the size of an implementation will result in better performance and a more efficient use of resources. The isolation of IOs by the type of data access will place like data together increasing sequential access potential and decreasing the potential for IO conflicts due to concerted access between active, transient and reference data types.

Bellow is a link to a worksheet which I had created a while ago to determine volume group names based upon data types.

VG SAN Layout WorksheetVG Storage Layout Excel Worksheet

Pushing Your Profile and SSH Keys

When ever you start supporting a new environment especially in a large corporation usually you are confronted with many systems.  Security will take care of setting up your access across whatever platforms there may be.  But generally you are left holding the bag with setting up your ssh keys and any profile customizations not to mention distribution of any scripts or tools you have come to rely upon.  Of course before you put any tools on a system there are several things to consider.  You definitely want to consider the environments you are first performing the distributions on and it is always good to start with development or lab environments and move out from there.  Also you will need to consider the corporate policies related to the environment which might limit your ability to even have your own set of tools and scripts.  You may be limited down to simple .profile changes and ssh keys.  Implementing a script to push these keys and profiles out may need to go through various degrees of red tape.  Whatever policies and requirements exist in your organization are your responsibility to know and to determine how or if the tools discussed here may be used.

Read more of this post