3 traditional problems of storage operation and non-traditional solution
If you are managing storage operations team, you probably have met already with 3 following issues:
not possible to limit admin access rights for less skilled team members
typos in the naming convention causing non-functional reports
the non-unified configuration of disk arrays
Nobody is solving these problems in most of the companies as there is no time, willingness to do it and often there is a belief that the solution is extremely expensive.
But the solution is relatively simple and surprisingly cheap.
Allow configuration of storage by admin without skills?
The current method of weak or even none granularity of admin rights is increasing your expenses to build the storage team.
I cannot allow access of junior admin to disk array as he didn't pass all storage pieces of training. What if he cut-off LUNs from the live server?
You know this situation. You're managing the storage department. You have a few younger team members. The older and more experienced ones are working on problems, design, planning. You would like to use the younger ones for common tasks like “customer needs immediately 100GB of disks”. And customer needs it immediately, often at night or during the weekends.
Vendors of disk arrays just ignore the granularity of access rights.
What's the problem? The disk arrays and FC switches usually allow defining users with different levels of privileges. The privilege granularity is, however, miserable. One level has read-only access, another level full admin access.
There are some platforms where there is possible to define user roles and explicitly lists commands which are allowed for the particular role. The problem is in the syntax of the commands.
If the platform would allow the syntax as follows:
create lun | share | zone | interface | …
modify lun | share | zone | interface | …
delete lun | share | zone | interface | …
the solution would be simple. The less skilled operator could have access rights only to command “create”. He can create the LUN of improper size and map it to an improper server in the worst case. The privileges for commands “modify” and “delete” will be assigned just to more skilled users.
But by my experience all vendors code the commands are as follows:
lun create | modify | delete
share create | modify | delete
zone create | modify | delete
So, if the user has access right for the command “lun” he can make anything with it.
And I'm speaking now about the platforms with the possibility of gentle tuning of user roles. As I mentioned above, mostly it's just read-only or full admin access.
Small highly qualified team or a lot of people with lack of training?
There are two ways how to solve it.
You have a small team of specialists responsible for complete storage administration. All of them are skilled and fully trained.
And extremely overloaded.
Even the small tasks require time of highly qualified specialist.
Or you have a big department. Split to “operations” running usual tasks and customer's request and “backline”. As all of them have full admin rights to all storage platforms, you have to let train all of them for all type of disk arrays and FC switches.
It is, but the outage of customer services caused by non-qualified action is even more expensive.
There is no other way.
But wait, there is another way! Just read next.
The typo causes invalid capacity reports?
Few typos invalidate your reports and don't allow right decisions.
The capacity reports don't match because we have sometimes typo in the names of groups and they don't match with the name of the server.
The expensive reporting software “damaged” by few typos.
You are using highly quality and expensive reporting software. But you cannot fully trust the reports because you know that there are some errors.
The typical example - report of assigned capacity for each server is based on the sum of LUN sizes in hostgroup matches the same name as the name of the server (or the name of the server is part of the hostgroup name).
If admin makes the mistake in the name of the hostgroup, report for the particular server will be empty.
In a case of few servers you will notice the mistakes but if the server count is going to thousands, they remain undiscovered. And if you will find some of them, the corrections must be handled as planned changes because it is a hit of the production environment.
The system that doesn't allow typo?
Imagine the system that doesn’t allow you to type wrong names. You even cannot select the live server by mistake.
The size of the LUN will be selected from a pre-defined list if your procedures define just particular LUN sizes.
Our magic system will search for the list of open tickets in the queue of your team in Remedy/Siebel/ServiceDesk. Will take the list of servers waiting for new capacity your admin will just choose one from a pull-down menu.
Even such magical system will log in to the server where you want to assign disks, will scan for WWNs of FC interfaces and create FC zones with the right initiators.
The real cost of non-unified configurations will appear after something important crash.
The upgrade is complicated because every node of disk array cluster has a different configuration.
The fail-over failed because the other node has not all services enabled.
Are you running audit of configurations?
It happened many times in my experience. I joined the new company, run an audit of production storage systems and found out that every of them has somehow different configuration.
Even each node out of HA pair has different setup therefore in the case of disaster the fail-over would probably never happen.
There wasn't any willingness to correct it by the operation team as it means to touch production systems. And it requires a lot of paperwork. (OK, a lot of typing on keyboard.)
I can read your mind now - you have exactly documented procedure how to configure every new disk array, FC switch. It cannot happen in your environment.
But it’s happening. The admin just skips some command by mistake. Or vendor specialist changes some parameters during troubleshooting but nobody will propagate this change to the other nodes.
And when you will detect misconfiguration, you don’t know who, when and why done it. Is it a mistake or on the purpose?
The auditing is usually disabled on disk arrays and switches as it generates huge log files.
How to handle planned configuration changes?
We have our magical system. It will send the right set of configuration commands to every new built box. Then you are sure that you have a really unified environment.
What about additional changes? Either planned ones or ad-hoc changes by the vendor during troubleshooting?
I suggest you two different approaches:
In the first one, all configuration changes you are doing through your “magical” tool.
You can change the configuration of more devices at the same time. The system is tracking who and when the change was provided. Even it's possible to add references to Remedy/Siebel/ServiceDesk ticket or text note.
Such system is not suitable when vendor specialist is sitting directly on the console and he's just trying a lot of commands hoping that some of them will solve the problem.
Then the second approach will be more suitable.
The configuration of all systems is uploaded to a central repository in regular intervals.
Every change against the previous state generates incident in your ticketing system (Remedy/Siebel/ServiceDesk/…).
The incident is handled as not-approved change and configuration has to be reverted back to the previous state.
Or the change is committed by the right reason and a responsible person.
Currently, I don't know about such pre-made system and I've been trying to create it myself.
If you know about something made already, give me a notice.
If I will see interest about such system it will speed up the development.
Our requirements for such “magic” system
You need some magical system that will save you. It’s clear. What it should be capable?
Let's try to get together requirements:
allows you to execute just exactly defined commands with exactly defined parameters (it will prevent to delete or change existing configured objects by mistake)
if it's possible then it allows selecting parameters from the list (to avoid typos)
the list can be static (like sizes of LUNs)
the list can be dynamic (list of servers waiting for initial configuration)
the system must check inputs, suggest right naming convention
the system must not allow executing commands not authorized for the particular user
the system must be able to collect data from other sources
systém must be able to manage authorization on user groups level
the system must track execution of every procedure with a recording of date/time, username and return code
The following options we can consider as “nice to have” ones:
authentication by ActiveDirectory or LDAP
possibility to define basic functions that can be possible to join to complex “workflows”
the possibility of conditional operation execution (e.g. “if it doesn't exist, create it”)
visualization of the whole procedure
web interface (to not install OS depended application)
possibility to design input forms
possibility to define approval points inside the procedure
(I will probably update this list during a time. Your tips are welcomed.)
Is it possible to build such system?
But it exists already!
The automation or workflow
Everybody are speaking about it but nobody is using it.
(Google AdWords claims that “storage automation” is searched just 70x monthly and searching for “storage workflow” is under the radar of Google.)
The automation of creating virtual servers, Oracle databases,... Nothing unusual.
But automation of storage capacity provisioning? Nobody wants to do it.
But don't worry.
You have documented every procedure, every naming convention.
Easy way - existing procedures wrap by script
The simple, cheap but effective way is few well-written scripts.
If you are using console commands for every task, it’s piece of cake.
Just put them together in the right order and wrap them by some scripting language.
Yes, it's such simple.
(If you are clicking in management GUI, you have to convert all actions to command line version.)
The user will not log in disc array or FC switch. The user will log in your server, where she/he is authorized to run only particular scripts.
I know, what do you think now. If the user is running the script with commands accessing disk array, the user can access the array directly, isn’t it?
It’s about the trust. If there will be a rule “no direct access to array” and users will use scripts every time, it will work.
If you are paranoid (like me), there is a way how to split effective user running a script from the effective user running commands on the disk array. Contact me and I will explain it to you.
More complicated, but complex way
When scripts are not enough, use the pre-made WorkFlow system.
In the requirements list above there are some items not easily implement by scripts. It requires a higher level of programming and finally you will work on complete custom software.
To save your time, I have one ready solution for you.
Stop, stop, stop! Don’t worry.
I don’t want to sell you expensive software just to get a commission.
I offer you a free software!
It’s a software developed by a commercial company well known many years in the storage area but it’s free. Strange, I know.
The company designed it for usage with own storage but the beauty is that you can control almost anything.
NetApp OnCommand WorkFlow Automation (WFA)
Good solution for a good price? What even free of charge?
NetApp WFA is a software offering by official documentation:
Designer portal to create workflows. The designer portal contains building blocks (commands, search engines, filters, functions) to build complete workflow. Allows special options like automatic resource selection based on defined criteria, loops, approval points.
It allows to collect data from external sources and use them during workflow execution.
Execution portal to run particular workflows, check return codes, check log files.
Administration portal to configure WFA itself and to manage users and privileges.
Web services interface allows calling existing workflows from external systems by REST API.
Storage Automation Store allows downloading pre-made workflows either from NetApp or from the wide user community.
The system can run on Windows or Linux server.
The commands can be written in PowerShell (Windows) or in Perl (Windows, Linux). Linux fans have to live with the fact that most of the workflow packages are written in Powershell
But if you are planning to automate non-NetApp systems, you need to write own scripts anyway.
The software is free, doesn’t require any license key. Just create an account on NetApp support web page (http://support.netapp.com) and download the particular installation package.
In the case of interest, I can prepare some video with simple workflow implementation.
Just contact me on Radovan.Turan@radovanturan.com and I will notify you when ready.
(Higher interest will motivate me to do it faster.)