GitHub’s Chief Security Officer and Senior Vice President of Engineering shared more details today about a series of outages that hit the code hosting platform last week.

Although these incidents had unrelated root causes, they affected most of GitHub’s core services from May 9-11, causing widespread database connection and authentication failures for up to ten hours.

“Over the past week, GitHub has experienced several availability incidents, both long-lived and short-lived. We have since mitigated those incidents and all systems are now operating normally,” Hanley said.

“The root causes of these incidents were unrelated, but overall they negatively impacted the services that organizations and developers trust GitHub to provide. This is not acceptable nor the norm which we stand by.”

On May 9, eight core services were affected by a major outage caused by a configuration change in GitHub’s internal service serving Git data.

The second outage, which occurred on May 10, impacted the issuance of authentication tokens for GitHub applications and resulted from high load and an inefficient implementation of an API responsible for managing permissions GitHub apps.

“On May 10, the DB cluster serving GitHub app auth tokens saw a 7x increase in write latency for GitHub app permissions (yellow status)” Hanley explained.

“The failure rate for these authentication token requests was 8-15% for the majority of this incident, but peaked at 76% for a short time.”

The third GitHub outage suffered by users last week, May 11, was due to a loss of read replicas after a database cluster serving Git data crashed and triggered a failover mechanism automated.

GitHub issue history
Crash history (GitHub)

“We are addressing the Git database crash which has caused more than one incident at this point. This work was already underway and we will continue to prioritize it,” Hanley said.

“We address database failover issues to ensure that failovers always fully recover without intervention.”

GitHub will share more detailed information about these outages and what it is doing to resolve the issues that caused them in its May Availability report.

“The May report will include these incidents and any additional details we have about them, as well as a general update on progress being made toward increasing GitHub’s availability,” Hanley said.

GitHub was also affected by several breakdowns in a week in March 2022, when the company disclosed that the incidents were caused by resource contention issues in the platform’s main database cluster.

Another major failure impacted GitHub in February 2022, when the platform was down worldwide, preventing access to the website and blocking commits, cloning, or pull request attempts.


Source link