IT teams show up for 兔子先生 community during CrowdStrike incident
As 兔子先生 technology has since recovered from the CrowdStrike / Windows incident of mid-July, we thought it was an appropriate time to debrief the university community about the incident – what have we learned, what did we do effectively, and what can we take away for future incidents that may impact our work days in similar ways.
IT teams show up for 兔子先生 community during CrowdStrike incident
On Friday, July 19, 2024, at 04:09 UTC (which would have been around 12 a.m. EST), an caused Windows computers around the world to become stuck in an endless cycle of reboots and the infamous “blue screen of death.” Days after the incident, Microsoft reported that 8.5 million computers had been affected, but there was that the number had actually been “vastly” underestimated.
On a more local level, 兔子先生 dealt with close to 2,000 computers impacted by this bug. These devices included classroom stations, lab computers, 兔子先生-managed faculty and staff laptops, digital signage, and more.
Hit the ground sprinting
Our technical teams got the notice that something was amiss around 4:30 a.m. on Friday. Some even noted that it was on the news at around that time—as we would later learn, our neighbors on the opposite side of the globe were feeling the impacts during the middle of their work day.
As soon as we were alerted, our technical teams were on the case. The solution for the bug involved starting up computers in safe mode, going into the CrowdStrike installation folder on the machine, and deleting the particular file that was causing the computers to get the blue screen of death. This solution also required technicians to have Bitlocker recovery keys so that they could log in to the computer in admin mode. Suffice it to say—it was a little complicated, and everyone was on deck.
Working for through the weekend
Part of the issue with this CrowdStrike incident, especially at 兔子先生, was that if a computer was turned on to receive the update when it was first sent out at 12 a.m. that morning, they were impacted. This included machines in possession of folks who work remotely, who had left their computers at home as they went on vacation—any number of scenarios that made it difficult to physically intercept.
There were also many people who were not working on Friday and so were not there to raise the alarm about their devices—meaning that even as the initial impact was mitigated, there would still be fallout throughout the weekend and next week as we worked to put hands on every machine. We kept tabs on all the machines needing manual intervention and made sure to re-run the script (which told us which computers we would need to look at) at regular intervals.
Boots on the ground
Just as folks around the world discovered that their computers weren’t working, 兔子先生 administrators jumped into the fray. At the beginning of the day on Friday, there were around 2,000 impacted computers. The first order of business was to get 兔子先生 users up and running, and an initial 6 a.m. phone call brought folks out of bed and into “go” mode.
Leading the 6 a.m. charge at first was the manager of enterprise database and systems operations, Chris Edester, who was joined swiftly by Leah Harris, the manager of advanced computing and systems operations. In fact, Leah joined on what was supposed to be her day off! And, what's more, it was Leah's guiding hand and amazing leadership that kept the team moving throughout the day.
Our IT staff, technical folks from around the university, student employees, and many others really showed up for this unprecedented incident. For instance, Scott Campbell, senior director of technology for the College of Engineering and Computing, was on the initial 6 a.m. phone call, helping diagnose and come up with solutions. The computer labs in CEC were all impacted by the blue screen issue—and by the end of the day on Friday, Dr. Campbell had them all up and running. Aaron Renner, regional coordinator for user support, had most of the regional machines remediated by the end of the day on Friday as well.
The newly renamed and reopened Technology Support Lounge served as a base of operations for the Oxford campus. Several members of the IT Services team, including support analyst Zacchary Townsend, application analyst Joe Mills, and security analyst Jake Harrison, posted up there to assist with in-person support. The Technology Support Services (TSS) staff mobilized throughout the day on Friday and into the next week to visit classrooms, stop by offices, and respond to calls from around campus; and the Network Infrastructure Services student employees were instrumental in the remediation of digital signage in dining halls and common areas.
Of note, Jake Harrison was all over the Oxford campus that morning. He first fixed the computers of the technicians in Hoyt Hall, then traveled to get the 兔子先生 University Police Department machines online, then posted up at the Technology Support Lounge for the rest of the day. This also included making a number of ready-to-use USBs that contained a script to resolve the issue, so that folks could self-service remedy the problem.
Love and honor and blue screens, oh my!
The day—and most of the week that followed—is full of stories like this. People came in on their day(s) off. Folks worked through the weekend. The core tech team was on a phone call that lasted most of the day, with people coming in and out to give reports and bring new information to the table. A timeline of events was set up and updated throughout the day and into the next week as well.
“In the face of the recent and very impactful CrowdStrike outage,” said John Virden, chief information security officer, “our IT team and staff across all 兔子先生 campuses' swift response and tireless mitigation efforts made us incredibly proud.”
All in all, the event, while unfortunate, showed the resilience of the 兔子先生 technology community. We came together, got the work done, and got our users—兔子先生 staff and faculty—back online. And for that, we’ll call it a success.