Tickets against complexity

SCL's Dusan Vudragovic participated in EGI ROD Team Workshop as a representative of Serbian National Grid Initiative AEGIS. The workshop was organized at the Science Park in Amsterdam on 1 and 2 June 2010, and its main goal was to present new operations procedures and tools to NGI representatives. The current set of procedures and operational documents from EGEE is gradually adjusted to the EGI framework and new NGI-based Grid operations model. National and Regional Operator on Duty (ROD) teams oversee EGI infrastructure following the common policy documents and procedures, and track alarms from different monitoring tools. National ROD teams follow up identified operational problems under their jurisdiction and coordinate work of Grid site managers on resolving all issues. This is done by opening tickets through the ROD dashboard, where highly-skilled Grid site administrators with large experience from EGEE and SEE-GRID series of project deal with day-to-day Grid operations problems.

Demand for an efficient yet lightweight set of procedures and tools emerges as an important requirement to leverage extreme complexity of the EGI infrastructure. To illustrate this, we briefly mention bottom line numbers: 59 participating countries, 330 Grid sites, more than 100k CPUs and 100 PB of data storage, 1000s of users, and more than10000 concurrently running jobs. The EGI infrastructure was organized previously in 11 Regional Operations Centers (ROCs), and now it is restructured to a federation of more than 40 National Grid Infrastructures (NGIs). Each NGI is established at the country level, gathers all national Grid sites under one umbrella and manages their operation and interoperation within EGI. NGIs are different in many ways: numbers of sites, CPUs, data storage and hardware architectures, different funding and organizational schemes. Consequently, efficient set of procedures and policy documents is essential to provide a unified approach to managing operations (operational support, infrastructure monitoring) through the processing of trouble tickets.