3 traditional problems of storage operation and non-traditional solution

Dream workflow If you are managing storage operations team, you are dealing with task how to run systems within your limited budget. Every technical issue costs you a money.
Every outage caused by human error is therefore irritating.

To eliminate human errors you are fighting with 3 following issues:

not possible to limit admin access rights for less skilled team members
typos in the naming convention causing non-functional reports
non-unified configuration of disk arrays

Nobody is solving these problems in most of the companies as there is no time, willingness to do it and often there is a belief that the solution is extremely expensive.

But solution is relatively simple and surprisingly cheap.

Allow configuration of storage by admin without skills?

The weak granularity of admin rights increases cost to build your storage team. Click to tweet

I cannot allow access of junior admin to disk array as he didn’t pass all storage trainings. What if he cut-off LUNs from live server?

You know this situation. You’re managing the storage department. You have a few younger team members. The older and more experienced ones are working on problems, design, planning. You would like to use the younger ones for common tasks like “customer needs immediately 100GB of disks”. And customer needs it immediately, often at night or during the weekends.

Vendors of disk arrays just ignore the granularity of access rights.

What’s the problem? The disk arrays and FC switches usually allow to define users with different levels of privileges. The privilege granularity is however miserable. One level has read-only access, another level full admin access.

There are some platforms where there is possible to define user roles and explicitly lists commands which are allowed for particular role. The problem is in syntax of the commands.

If the platform would allow the syntax as follows:

create lun | share | zone | interface | …
modify lun | share | zone | interface | …
delete lun | share | zone | interface | …

the solution would be simple. The less skilled operator could have access rights only to command “create”. He can create the LUN of improper size and map it to improper server in worst case. The privileges for commands “modify” and “delete” will be assigned just to more skilled users.

By my experience all vendors code the commands are as follows:

lun create | modify | delete
share create | modify | delete
zone create | modify | delete
...

Therefore if the user has access right for the command “lun” he can make anything with it.

And I’m speaking now about the platforms with the possibility of gentle tuning of user roles. As I mentioned above, mostly it’s just read-only or full admin access.

Small highly qualified team or a lot of people with lack of trainings?

There are two ways how to solve it.

You have a small team of specialists responsible for complete storage administration. All of them are skilled and fully trained.

And extremely overloaded.

Even the small tasks requires time of highly qualified specialist.

Or you have big department. Split to “operations” running usual tasks and customer’s request and “backline”. As all of them have full admin rights to all storage platforms, you have to let train all of them for all type of disk arrays and FC switches.

It’s expensive.

It is, but outage of customer services caused by non-qualified action is even more expensive.

There is no other way.

But wait, there is another way! Just read next.

Invalid capacity reports due to typo?

Few typos invalidate your storage reports and doesn’t allow right decisions. Click to tweet

The capacity reports don’t match because we have sometimes typo in the names of groups and they don’t match with the name of the server.

The expensive reporting software “damaged” by few typos.

You are using highly quality and expensive reporting software. But you cannot fully trust the reports because you know that there are some errors.

The typical example - report of assigned capacity for each server is based on the sum of LUN sizes in hostgroup matches the same name as the name of the server (or the name of the server is part of the hostgroup name).

If admin makes the mistake in the name of the hostgroup, report for particular server will be empty.

In case of few servers you will notice the mistakes but if the server count is going to thousands, they remain undiscovered. And if you will find some of them, the corrections must be handled as planned changes because it is a hit of the production environment.

The system that doesn’t allow typo?

Imagine system that doesn’t allow you to type wrong names. You even cannot select live server by mistake.

The size of the LUN will be selected from pre-defined list, if your procedures define just particular LUN sizes.
Nice idea?

Our magic system will search for list of open tickets in the queue of your team in Remedy/Siebel/ServiceDesk. Will take list of servers waiting for new capacity your admin will just choose one from pull-down menu.
Lovely idea?

Even such magical system will logon to the server where you want to assign disks, will scan for WWNs of FC interfaces and create FC zones with the right initiators.

Non-unified configurations?

The real cost of non-unified configurations will appear after something important crash. <a href=“http://twitter.com/home/?status=The real cost of non-unified configurations will appear after something important crash. http://radovanturan.com/3_problemy_storage_operations via @radovanturan” target=_blank" title=“Click to tweet”>Click to tweet

The upgrade is complicated because every node of disk array cluster has different configuration.

The fail-over failed because the other node has not all services enabled.

Are you running audit of configurations?

Probably not.

It happened many times in my expercience. I joined the new company, run audit of production storage systems and found out that every of them has somehow different configuration.

Even each node out of HA pair has different setup therefore in case of disaster the fail-over would probably never happen.

There wasn’t any willingness to correct it by the operation team as it means to touch production systems. And it requires a lot of paperwork. (OK, a lot of typing on keyboard.)

I can read your mind now - you have exactly documented procedure on how to configure every new disk array, FC switch. It cannot happen in your environment.

But it’s happening. The admin just skip some command by mistake. Or vendor specialist changes some parameters during troubleshooting but nobody will propagate this change to the other nodes.

And when you will detect misconfiguration, you don’t know who, when and why done it. Is it a mistake or on the purpose?
The auditing is usually disabled on disk arrays and switches as it generates huge log files.

How to handle planned configuration changes?

We have our magical system. It will send the right set of configuration commands to every new built box. Then you are sure that you have a really unified environment.

What about additional changes? Either planned ones or ad-hoc changes by vendor during troubleshooting?

I suggest you two different approaches:

In the first one, all configuration changes you are doing through your “magical” tool. You can change configuration of more devices at the same time. The system is tracking who and when the change was provided. Even it’s possible to add references to Remedy/Siebel/ServiceDesk ticket or text note.
Such system is not suitable when vendor specialist is sitting directly on the console and he’s just trying a lot of commands hoping that some of them will solve the problem.

Then the second approach will be more suitable.

Configuration of aIl systems are uploaded to central repository in regular intervals. Every change against the previous state generates incident in your ticketing system (Remedy/Siebel/ServiceDesk/…).
The incident is handled as not-approved change and configuration has to be reverted back to previois state. Or the change is committed with right reason and a responsible person.

Currently, I don’t know about such pre-made system and I’ve been trying to create it myself.
If you know about something made already, give me a notice.
If I will see interest about such system it will speed up the development.

Our requirements for such “magic” system

You need some magical system that will save you. It’s clear. What it should be capable? Let’s try to get together requirements:

system allows you to execute just exactly defined commands with exactly defined parameters (it will prevent to delete or change existing configured objects by mistake)
if it’s possible then it allows to select parameters from list (to avoid typos)
- list can be static (like a sizes of LUNs)
- list can be dynamic (list of servers waiting for initial configuration)
system must check inputs, suggest right naming convention
system must not allow to execute commands not authorized for particular user
system must be able to collect data from other sources
systém must be able to manage authorization on user groups level
system must track execution of every procedure with recording of date/time, username and return code

The following options we can consider as “nice to have” ones:

authentication by ActiveDirectory or LDAP
possibility to define basic functions that can be possible to join to complex “workflows”
possibility of conditional operation execution (e.g. “if it doesn’t exist, create it”)
visualization of the whole procedure
web interface (to not install OS depended application)
possibility to design input forms
possibility to define approval points inside the procedure

(I will probably update this list during time. Your tips are welcomed.)

Is it possible to build such system?
But it exists already!

The automation or workflow

Everybody are speaking about it but nobody is using it.

(Google AdWords claims that “storage automation” is searched just 70x monthly and searching for “storage workflow” is under the radar of Google.)

Storage workflow sample

The automation of creating virtual servers, Oracle databases,… Nothing unusual.
But automation of storage capacity provisioning? Nobody wants to do it.

But don’t worry.

You have documented every procedure, every naming convention.

Easy way - existing procedures wrap by script

Simple, cheap but effective way to automate storage is few well written scripts. Click to tweet

If you are using console commands for every task, it’s piece of cake.

Just put them together in the right order and wrap them by some scripting language.

That’s all.

Yes, it’s such simple.

(If you are clicking in management GUI, you have to convert all actions to command line version.)

The user will not logon to disc array or FC switch. The user will logon to your server, where she/he is authorized to run only particular scripts.

I know, what do you think now. If user is running script with commands accessing disk array, the user can access the array directly, isn’t it?
It’s about the trust. If there will be a rule “no direct access to array” and users will use scripts every time, it will work.

If you are paranoid (like me), there is a way how to split effective user running a script from effective user running commands on disk array. Contact me and I will explain it to you.

More complicated, but complex way

When scripts are not enough for your storage, use pre-made WorkFlow system. Click to tweet

In the requirements list above there are some items not easily implement by scripts. It requires higher level of programming and finally you will work on complete custom software.

To save your time, I have one ready solution for you.

Stop, stop, stop! Don’t worry.

I don’t want to sell you expensive software just to get commission.

I offer you a free software!

It’s a software developed by commercial company well known many years in the storage area but it’s free. Strange, I know. The company designed it for usage with own storage but the beauty is that you can control almost anything.

Drumroll please!

NetApp OnCommand WorkFlow Automation (WFA)

Good solution for good price? What even free of charge?

NetApp WFA is a software offering by official documentation:

Designer portal to create workflows. Designer portal contains building blocks (commands, search engines, filters, functions) to build complete workflow. Allows special options like automatic resource selection based on defined criterias, loops, approval points.
It allows to collect data from external sources and use them during workflow execution.
Execution portal to run particular workflows, check return codes, check log files.
Administration portal to configure WFA itself and to manage users and privileges.
Web services interface allows to call existing workflows from external systems by REST API.
Storage Automation Store allows to download pre-made workflows either from NetApp or from wide user community.

The system can run on Windows or Linux server.

The commands can be written in powershell (Windows) or in perl (Windows, Linux). Linux fans have to live with fact that most of the workflow packages are written in powershell.

But if you are planning to automate non-NetApp systems, you need to write own scripts anyway.

Software is free, doesn’t require any license key. Just create account on NetApp support web page (http://support.netapp.com) and download particular installation package.

In case of interest I can prepare some video with simple workflow implementation.

Just contact me on Radovan.Turan@radovanturan.com and I will notify you when ready.
(Higher interest will motivate me to do it faster.)

Allow configuration of storage by admin without skills?#

Vendors of disk arrays just ignore the granularity of access rights.#

Invalid capacity reports due to typo?#

The system that doesn’t allow typo?#

Non-unified configurations?#

Are you running audit of configurations?#

How to handle planned configuration changes?#

Our requirements for such “magic” system#

The automation or workflow#

Easy way - existing procedures wrap by script#

More complicated, but complex way#

NetApp OnCommand WorkFlow Automation (WFA)#