Fixing Excel Data – Text to Rows and Fill Down

Pivot table

Image via Wikipedia

Excel spreadsheets are used at almost every level of business.  As such spreadsheets are created by people with different goals and different levels of familiarity.  What many don’t realize is that excel is primarily a data processing and information storage tool.  Excel is much like a database and is most effective when data is stored as one would store information in a database.  The problem is that data in a database isn’t necessarily that presentable.  So many people input information into excel and attempt to format it for presentation.  Many people create groups and merge cells or insert multiple entries into a single cell and then apply special formatting.  The down side is when someone wants to use the graphing and report building features of excel the combined or grouped data isn’t accessible because it is either missing from cells due to cell merging or each entry isn’t accessible on its own because it has been entered into a single cell.  The best method is to treat the data as a database and ensure that each cell in a column has data in it relative to the cells in the row so that a complete record is available.  Then use the grouping and pivot table features to create reports that are presentable and much more functional because the data can be grouped and categorized in two dimensions.

But what do you do if you get  a spreadsheet in which data has been either entered into one cell but not in subsequent rows in some cases or has been merged through rows?  The first thing to do is to select the column with merged data and unmerge it. But then what now.  You can either go through manually massage the data adding rows and extracting the entries entered in a single cell and use a fill down copy where an entry was merged through rows.  But if you have hundreds of rows this can be time consuming.   I have run into this problem frequently enough that I have taken some time out to write a couple functions that spead up the process.

Text To Rows

If you have ever imported data into excel you may be familiar with the text to columns function under the data menu.  If you didn’t specify delimiters or missed one while importing you can select the column with multiple fields of data and tell it what to split the data on and it will automatically split the data across columns.  But what about rows.  The procedure below may be run on any column where multiple entries have been combined into single cells.  The default delimiter is a comma but this may be modified to meet your particular needs.

Sub TxtToRows()

‘ Procedure TxtToRow
‘ Author : Devin Adint
‘ Purpose: This procedure extracts multiple entries in a single cell into
‘ newly inserted subsequent rows, preserving the contents of the
‘ surrounding cells in that row. This procedure helps to address
‘ times when excel isn’t treated as a database and splits out the
‘ unique data into its own row as one would expect in a data set.

Dim ThisRow As Long
Dim vCellData As Variant
ORIGSHEET = ActiveSheet.Name
ORIGADDR = ActiveCell.Address
CrntRow = ActiveCell.Row
CrntCol = ActiveCell.Column
ThisRow = CrntRow

With ActiveSheet

Do While .Cells(ThisRow, CrntCol) <> “”
‘ Extract cell data into array
vCellData = Split(Cells(ThisRow, CrntCol), “,”)
vCellEntryCount = UBound(vCellData)
‘ loop through each delimited value from cell stored in the array
‘ and copy the current line inserting a copy below
‘ then over-write the columns cell with the delimited value
‘ If there was only 1 entry move to the next row
If vCellEntryCount = 0 Then

ThisRow = ThisRow + 1

Else ‘ else loop through the entries adding a new row

For vCellCount = 0 To vCellEntryCount

vCellEntry = vCellData(vCellCount)
‘ If this is not the last entry insert a copy of the current line
‘ for the next entry
If vCellCount < vCellEntryCount Then

.Rows(ThisRow).Copy
.Rows(ThisRow).Insert Shift:=xlDown

End If

‘ Update the current line of the multi-entry cell being processed
‘ with the current entry
.Cells(ThisRow, CrntCol) = vCellEntry
‘ Move to the next row
ThisRow = ThisRow + 1

Next

End If
Loop

End With
End Function

 

Fill Down Copy

If you have data in one cell which is either merged through rows or in a single cell with subsequent rows blank you will not be able to group by this column until each record has the corresponding data in each cell in the column.  You could page down and select each populated cell and then double click the little corner box to perform a fill down copy or you can execute a procedure.  The procedure below may be run to copy down data into unpopulated cells.

Sub FillDownCopy()

‘ Procedure FillDownCopy
‘ Author : Devin Adint
‘ Purpose: This procedure moves down a column copying down data
‘ into unpopulated columns

While Selection.End(xlDown).Row < 1048575

If (ActiveCell.Offset(1, 0).Value2 = “”) Then

Selection.Copy
Range(ActiveCell.Offset(1, 0).Address).Select
Selection.End(xlDown).Select
Range(ActiveCell.Offset(-1, 0).Address).Select
Range(Selection, Selection.End(xlUp)).Select
ActiveSheet.Paste
Application.CutCopyMode = False
Selection.End(xlDown).Select

Else

Range(ActiveCell.Offset(1, 0).Address).Select

End If

Wend
End Function

Optimizing Disk IO Through Abstraction

To Engineer or Not To…

When disk capacity is released to a new application or service many times the projects do not consider how best to use the storage that has been provided. Essentially the approaches fall into one of two schools of thought. The first is to reduce upfront engineering into a couple design options and resolve issues when they arise. The second is to engineer several solution sets with variable parameters that will provide a broader pallet of solutions and policies from which an appropriate solution may be selected.

Reduced Simplified Engineering

  • Apply one of a couple infrastructure designs to a project.
  • This approach involves less work upfront, has a simpler execution and involves less work gathering requirements.
  • Potentially more time and effort will be spent resolving issues when resources and design are insufficient.

Solution Engineering

  • Develop several standard solution sets and models and document policies and procedures which will be used to fit applications/components into these models.
  • This approach involves substantial requirements gathering and entails more work and complexity.
  • The result is a more efficient use of resources and less time spent resolving issues because the design and resources should better fit the system needs.

Engineering Disk Performance

When it comes to disk performance the cost of not considering how to optimize the use of the storage infrastructure can be very significant. Consider the IO operation cycle time table. The order of magnitude difference between CPU operations, memory operations and that of device IO is significant. If cycle time is scaled up to a second the comparative difference is between seconds and that of device IO taking months to years to complete. Poorly orchestrated device use cases can result in significant impact. A memory miss will only cost a few hundred nanoseconds but a disk IO cycle miss could cost a couple dozen milliseconds. To put it in the perspective of scale we are talking about the difference between four minutes and eight months.

 

Device Cycle Time Cycle in Seconds Scaled Cycle
CPU 1 nanosecond 1.00×10^-9 1 seconds
CPU Register 5 nanoseconds 5.00×10^-9 5 seconds
Memory 100 nanoseconds 1.00×10^-7 2 minutes
Disk 10 milliseconds 1.00×10^-2 4 months
NFS Op 50 milliseconds 5.00×10^-2 2 years
Tape 10 seconds 1.00×10^1 3 centuries

 

IO Op Cycle Time Scales

How IO devices are use by systems can impact device performance in several ways. First, between the system and the device is usually a cache which attempts to predict what data will be needed and pre-stages it in faster memory in anticipation of any potential access requests. Traditional cache algorithms identify data that is accessed in ordered proximity on the IO device and then reads ahead to pre-fetch subsequent data. These pages are then left in cache until they are aged out due to subsequent access to other data on the device. The result is that sequential IOs will experience better performance because the data will be in cache. However random IOs and IOs that occur sporadically will require direct access to the slower storage device. Newer caching algorithms such as SARC have algorithms designed to improve anticipation of random IO but are no substitute for considering data layout on the devices. An example of incorrect data layout would be over aggressive server side striping using volume management which can cause sequential IOs on the system to appear as random on the storage device; in essence, circumventing caching. The second area of consideration is the storage device back-end performance. Back-end performance can be impacted by IO sizes and distributions across array boundaries. The most significant impact comes from conflicting IOs that request data from different portions of storage device media at the same time. The result is what is termed “thrashing” because the device access mechanism must be readdressed from one part of the media to another requiring increased times for continuous repositioning.

Optimizing the Layout of Information on Storage

Component Activity Diagram

Component Activity Diagram

With all these potential factors that can impact storage performance how can a method for optimizing how storage devices are used be formulated? Some of the decisions have to be made on the back-end by the Storage Analysts that configure and present the storage to hosts. Once storage is presented to a host how the storage is used in light of these factors needs to be considered. Back-end configuration considerations vary by storage device implementation but most storage sub-systems have facilities that automate redistribution of data to balance IOs across the device media. Directions on optimum logical device configuration on the storage system are available in architecture documentation and from field engineers. But once these configured devices are made available to a system a method of laying out data needs to be considered that will increase sequential IO and reduce IO conflict potential. To develop a standard method rather then create unique custom solutions for each application component is more advantageous to establish a more predictable, re-creatable and supportable environment. This method should consider three factors. It should consider what the component does, what the component stores and how large of an implementation the component is supporting.

Categorizing Component Functionality

Application Functional Categories

Application Functional Categories

The first step is to determine what an application component does. Some applications are a single entity while other applications are non-monolithic and break the work they perform down into subsets of functionality performed by components. All applications or component functionality can be categorized into one or more of four possible categories. The first category is client servers which include; web servers, web portals, portal servers, SNA gateways, and Citrix servers. This category is mostly comprised of executables and static configurations or models with some temporary file requirements. The next category of functionality is utility servers. Utility servers provide peripheral or supporting functionality such as reporting, authentication and include; LDAP servers, Active Directory servers, messaging, certificate servers, file transfer servers, EDI and OLAP servers. This category has a broad constituency including executables along with configuration files, temporary file space and data space for certificates and directory structures. The IO profiles can range from sporadic to rather intense when dealing with OLAP servers. The next category is applications servers that provide framework for software development through distributed objects and APIs that provide business logic and business process functionality and include, IBM Websphere, GlassFish, DataStage TX, BEA Tuxedo, and AX Workflow manager. This category generally has executables, configuration files and uses some temporary space but interfaces with data servers for any major data functions. The final category of application components are data servers that manage access to structured, semi-structured and unstructured data. Data servers include databases such as Oracle and DB2 and document management applications such as Documentum. This category generally includes executables with configuration files and large data stores that are managed by a data engine and tend to perform the brunt of data manipulation. Categorizing components using these definitions should help to narrow down storage infrastructure performance requirements.

Categorizing Storage IO Activity

The next step is to determine what types of data is stored and how it is accessed by a component. The types of data that an application component stores and accesses fall into one of four categories based upon how the data is used and accessed, see the diagram below of the categories.

Types of Storage Access

Types of Storage Access

The first category is static data which is comprised of executables, configuration files, models, static images, templates, style sheets, etc. This data usually gets loaded once when a process starts and much of what is needed is instantiated in the systems memory as a running process or cached frequently used templates and images files and therefore IO happens at start up and then less frequently there after. The next category is transient data which is comprised of temporary files, database transaction logs, database redo logs, spools, queues, temporary database tables etc. This data is usually only accessed a few times and then either purged or over-written in a cyclical file. The IO profile is more random and sporadic and in the case of data servers usually works in concert with access to retained active data. The next category is active data which includes RDBMS data, HDMS data, file shares, document or image stores, etc. This is the retained working data set used by an application component and typically involves heavier sequential IO operations. The final category is reference data and includes RDBMS indexes, HDMS directory data and meta data and is generally referenced in concert with access to retained active data to identify what active data needs to be accessed based upon criteria stored in the reference data. Categorizing the file systems/directories specified by a component into one of these categories will help to identify potential data structuring to achieve an isolation of IO to reduce contention and to consolidate similar IO patterns there by increasing sequential IO potential.

Determining a Size Model

Model Description
Small 50 – 256 GB or up to 1200 IO/s
Medium 256 – 750 GB or up to 1800 IO/s
Large 750 – 5000GB or up to 3600 IO/s
Huge >5000GB or >3600 IO/s

Now that we know what the application component does and we know what types of data files it uses the next step is to determine how large of an implementation this component is supporting. The size should be made based upon the size of the file spaces and/or the estimated number of IOs. This model should then be used to determine if the file spaces need to be separated and to what extent based upon the storage activity category.

Using OS Storage Abstractions to Optimize Layout and Use

Finally, now that much has been learned about the component being implemented and a determination of file space and implementation size; what mechanism may be used to manage IOs? Most operating systems have facilities for creating logical abstractions of storage. The diagram of OS Storage Abstraction below depicts a typical volume management facility. This is true of HP LVM, AIX Volume Manager and Veritas Volume Manager and is even true of Solaris z-pools and zfs file systems.

OS Storage Abstraction Functionality

OS Storage Abstraction Functionality

These abstraction structures provide the means to limit storage access in three ways. Device groups, Volume Groups or Windows labeled Disks provide a means of isolating IO. The logical partitions are distributed across the presented disks based upon several parameters but all their IO is limited to the devices within the group. The logical partitions and file systems limit data growth and directories provide a logical segregation of files into hierarchies for organizational purposes. Once an implementation size is determined a model may be assigned. In a small model all the storage activity categories may be handled by a single device group. In a medium model the reference and transient data should be placed in a device group and the static and active data placed in another device group. In a large model Static and reference data are placed in one device group, transient data into a second and active data into a third device group and file systems created per specifications. Finally if a component is determined to have a huge IO / Size requirement four or even more device groups should be considered including multiple partitioning of transaction, active and reference data.

Division of the volume groups may be tackled in several ways. One way would be to track all device group names and their purpose and add new device groups based upon the new application components the service or application they are supporting. This becomes more complex in shared service environments where one system may host several instances of a component. An easier way to accomplish this is through using a device group naming convention. With a naming scheme analysts don’t need to know what device groups already exist, at a glance how storage is used on a system can be seen from the device group names and it eliminates the creation of unnecessary volume groups since any new file systems that fall into pre-existing categories will end up being assigned to the appropriate pre-existing device group. Essentially the used device group names are self-tracking since the application of the naming process should result in the same name or a new name when applicable. The implementation of the naming scheme can be automated in a spreadsheet form using lookups based upon the models and categorizations of the file systems requested in the form. The naming scheme I have used has the following form:

  • [app abbreviation up to 4 characters][00 Instance number][storage activity category(STAR)][Environment Code]
  • example: Oracle People Soft Data Table Production = orps1ap
    Oracle People Soft Index Table Production = orps1rp
Component Code Storage Access Code Environment Code
Client C Static S Development D
Utility U Transient T Integration Test I
Application A Active A Acceptance Test A
Data D Reference R Quality Assurance Q
Production P
Disaster Recovery
R

Conclusion

Exploring the structures that comprise an application component and then making informed judgments based upon the size of an implementation will result in better performance and a more efficient use of resources. The isolation of IOs by the type of data access will place like data together increasing sequential access potential and decreasing the potential for IO conflicts due to concerted access between active, transient and reference data types.

Bellow is a link to a worksheet which I had created a while ago to determine volume group names based upon data types.

VG SAN Layout WorksheetVG Storage Layout Excel Worksheet

Storage Capacity KPIs

When I first started working with Distributed Storage for many years I worked with Asset Management and various other departments to answer the question, “How much storage do we have available and how much is used?”. The problem was depending upon how the numbers were sumarized and presented various impressions were left with management that didn’t communicate a complete picture. This invariably led to inaccurate assumptions that required many subsequent explanations. If we are to overcome these problems and communicate a clear picture of storage capacity we must address several issues.

The first issue to be addressed is whether to report storage capacities as raw capacity or usable capacity. The simplest method is to report raw but since these numbers do not take into account protection and management overhead service delivery management is tempted to think that more storage is available then what is available in reality. If these overheads aren’t taken into account when projecting future demand the projected supply may be overstated. For this reason it is probably best to provide facilities for reporting both raw numbers which will be used more in the day to day support and usable numbers for planning and estimation purposes.

Read more of this post

Pushing Your Profile and SSH Keys

When ever you start supporting a new environment especially in a large corporation usually you are confronted with many systems.  Security will take care of setting up your access across whatever platforms there may be.  But generally you are left holding the bag with setting up your ssh keys and any profile customizations not to mention distribution of any scripts or tools you have come to rely upon.  Of course before you put any tools on a system there are several things to consider.  You definitely want to consider the environments you are first performing the distributions on and it is always good to start with development or lab environments and move out from there.  Also you will need to consider the corporate policies related to the environment which might limit your ability to even have your own set of tools and scripts.  You may be limited down to simple .profile changes and ssh keys.  Implementing a script to push these keys and profiles out may need to go through various degrees of red tape.  Whatever policies and requirements exist in your organization are your responsibility to know and to determine how or if the tools discussed here may be used.

Read more of this post