Unknown Issues Causing Problems with Network Drives

Several CHE and ISE network drives may currently be inaccessible due to issues with the hosting infrastructure at Enterprise Infrastructure & Operations (EIO). EIO is aware of the issue and is currently working on it. I do not yet have an estimate on when services will be restored. This is impacting several units on campus that have their hosting on the affected storage cluster.

As news becomes available I will make sure to send additional information if it looks like the downtime will be extended. Otherwise expect my next email after I’m able to confirm that everything is back.

If you have questions or comments please call 392-9217 or email mis@eng.ufl.edu.

UPDATE1 (@4PM 2015-04-10): As of 4:00pm services are restored.

UPDATE2 (@9AM 2015-04-13): UFIT has released a statement regarding Friday’s downtime.

On Friday afternoon, April 10,  UFIT worked an incident that had wide-spread impact; so I’d like to give you the details:

About 3:15 PM Friday the UF Computing Help Desk began receiving calls about MediaSite being unavailable. Within a few minutes it was apparent that the problem was major, so the Help Desk notified UFIT’s Video Services and Operations groups.

The Operations group contacted appropriate personnel who quickly realized that the problem was with the Isilon storage system.  Specifically that the Master Control Process on the cluster was hung and consuming 100% CPU capacity. UFIT staff immediately engaged EMC technical support so they might see the problem “live.”  At the suggestion of a UFIT sysadmin, EMC killed that process at roughly 3:30 PM; at which point the Isilon storage server began functioning normally.

The outage lasted about 15 minutes, from 3:15 PM – 3:30 PM.

It should be noted that other services were affected, including UFirst and Network Shared Drives, though only MediaSite problems were reported to the Help Desk.    UFIT and EMC are now investigating to determine what caused the MCP process to tie up 100% CPU utilization on the cluster, and to implement appropriate monitoring (short term) and ultimately resolve this problem (longer term, unknown duration). 

If you have any questions or comments about this incident, please let me know.