运维自动oncall_oncall是软件工程的50岁

运维自动oncall

I am not going to lie; software engineers have a pretty good life. We can come into the office around 10 AM and still leave around 5 PM. We rarely need to work on weekends. We can work from home if we need to. We get unlimited snacks and free food for lunch, and taking a personal vacation is encouraged. Most of the time, it is a comfortable life, but for me, work can get pretty stressful and chaotic.

我不会说谎；软件工程师的生活还不错。我们可以在上午10点左右进入办公室，但仍在下午5点左右离开。我们很少需要在周末工作。如果需要，我们可以在家工作。我们提供无限量的小吃和午餐免费食物，鼓励您度过个人假期。在大多数情况下，这是一种舒适的生活，但是对我来说，工作可能会变得非常压力和混乱。

People rarely associate on-call with software engineers. For most engineers in tech, on-call is a pretty light load. It can even be non-existent, especially for front-end engineers. For backend engineers, on-call can get heavy. People can constantly ping you on Slack about issues they are facing. You can get woken up at 3 AM on a weekend to solve a live technical issue. Losing sleep and having a noisy workweek can be a common theme for on-call.

人们很少将通话与软件工程师联系在一起。对于大多数技术工程师而言，随叫随到的工作量很小。它甚至可能不存在，尤其是对于前端工程师而言。对于后端工程师而言，随时待命可能会变得很繁重。人们可以不断地在Slack上向您发送有关他们面临的问题的信息。您可以在周末凌晨3点醒来解决现场技术问题。失眠和每周吵闹可能是通话中常见的主题。

我的团队 (My Team)

I work on a backend team that focuses on distributed messaging systems, and these systems serve as the foundation of the company’s product. If our services become unresponsive, the product will be non-functional and consequently will lose revenue. Because our services are Tier 1, on-call is required to maintain the health of our systems.

我在一个专注于分布式消息传递系统的后端团队中工作，这些系统是公司产品的基础。如果我们的服务不响应，则该产品将无法正常运行，并因此会损失收入。因为我们的服务是Tier 1，所以需要随时致电以保持系统的健康。

Hundreds of applications leverage our messaging systems, and because of the high volume in usage, things break unexpectedly. My team has one of the highest issue count across the company. In a single week, my team can receive over 150+ issues.

数以百计的应用程序利用了我们的消息传递系统，并且由于使用量巨大，因此意外中断了。我的团队是全公司发行量最高的公司之一。在一周内，我的团队可以收到150多个问题。

In a typical year, I am on-call for about 12 times, and each on-call shift lasts for 3 to 4 days. For those 3 or 4 days, I have to be available to fix issues, even if it is 3 AM on a Saturday. Being on-call means being handcuffed to my phone and laptop at all times. A critical incident is bound to happen, and I never know when that critical incident will occur. This is what makes on-call scary!

在通常的一年中，我待命大约12次，每次待命值班持续3到4天。对于这3或4天，即使星期六是凌晨3点，我也必须有时间解决问题。随时待命意味着随时被戴在手机和笔记本电脑上。 紧急事件一定会发生，我不知道何时会发生紧急事件 。这就是通话中令人恐惧的原因！

通话中的关键问题 (Critical Issues During On-call)

During on-call, I encounter numerous types of issues, but common issues that arise are:

在通话期间，我遇到许多类型的问题，但是出现的常见问题是：

a machine’s unavailability
机器不可用
lag in reading data
读取数据时滞
storage space is filling up
储存空间已满
latency in publishing data
发布数据的延迟

If a machine is unavailable, applications will not be able to read data from it and process the data for downstream services. This incident could affect the flow of the app.

如果机器不可用，则应用程序将无法从其中读取数据并为下游服务处理数据。此事件可能会影响应用程序的流程。

Lag in consuming data impacts users. Users may experience this lag when loading up content. Imagine consistently waiting for more than 5 minutes to wait for pictures to show up. This user experience could trigger a drop in user growth and as a result revenue.

数据消费滞后会影响用户。用户在加载内容时可能会遇到这种延迟。想象一下，持续等待5分钟以上才能等待图片显示。这种用户体验可能会导致用户增长下降，从而导致收入下降。

To solve these issues, I have to open up countless tabs on my Google Chrome and analyze data from multiple dashboards so that I can get a complete understanding of what is going on. For instance, one graph that I like to look at is the traffic coming in over time. If I see a sudden spike in traffic and a sudden drop in the success rate of our services at the same time, I would reach out to the team who owns that data and ask them to lower their traffic.

为了解决这些问题，我必须在Google Chrome浏览器上打开无数的标签，并分析来自多个仪表板的数据，以便我可以全面了解正在发生的事情。例如，我要查看的一个图表是随着时间的流逝而来的流量。如果同时出现流量激增和服务成功率突然下降的情况，我会联系拥有该数据的团队，并要求他们降低流量。

Without these dashboards, there is no way to determine how to solve an issue. Having a well-maintained monitoring system allows your services to be reliable and stable. If your services cannot guarantee these characteristics, teams will not leverage your services.

没有这些仪表板，就无法确定如何解决问题。 拥有维护良好的监视系统可使您的服务可靠且稳定。 如果您的服务不能保证这些特征，则团队将无法利用您的服务。

页面原因 (Causes for Pages)

There are countless reasons for numerous pages to occur. Some are:

出现大量页面的原因有很多。一些是：

bad deployments that have a bug in the release
版本中存在错误的错误部署
scheduled outages in a particular datacenter
特定数据中心的计划内停机
unexpected increased customer traffic
意外的客户流量增加
operational error
操作错误
noisy alarms
嘈杂的警

Some issues arise from a lack of communication. For instance, if a team increases their traffic, they should tell us beforehand so that we can make any modifications to capacity, if needed. If they do not, an increased read or consumer lag can occur, and memory space may fill up quickly.

一些问题是由于缺乏沟通引起的。例如，如果团队增加了流量，他们应该事先告诉我们，以便我们可以在需要时进行任何修改。如果不这样做，可能会增加读取或消耗者的延迟，并且存储空间可能会很快填满。

An operational error can stem from a lack of documentation on the runbook. Runbooks describe what commands to run in order to perform a specific task, for instance draining or restarting a machine. However, not all runbooks are perfect. They may contain outdated information. Certain information may even be nonexistent. If this happens, an engineer may run a wrong command without realizing it.

操作错误可能源于运行手册上缺少文档。运行手册描述了为执行特定任务(例如排空机器或重新启动机器)而运行的命令。但是，并非所有Runbook都是完美的。它们可能包含过时的信息。某些信息甚至可能不存在。如果发生这种情况，工程师可能会执行错误的命令而没有意识到。

It is important to update the runbook accordingly whenever a new type of operation or new tooling is introduced. Updating runbooks should be a proactive effort, not a reactive effort.

每当引入新的操作类型或新的工具时，相应地更新运行手册就很重要。 更新运行手册应该是积极的努力，而不是被动的努力。

迈向可持续通话的步骤 (Steps Toward a Sustainable On-call)

My team has recently focused on making our alarms less noisy. What this entails is:

我的团队最近致力于降低警的噪音。这意味着：

adjusting alarm thresholds to be precise
精确调整警阈值
determining which metrics need to be alerted and which one we can ignore
确定哪些指标需要提醒，哪些指标我们可以忽略
determining if a metric that needs to be alerted on is a high critical issue or a low critical issue
确定需要警告的指标是高危问题还是低危问题

Another effort to make on-call more bearable is to snooze alarms when an operation or deployment is happening and to communicate that to the team. As a result, this engineering practice forces the engineer that is handling the operation to be responsible for that service. If an issue does arise during that operation, it is the responsibility of that engineer. The burden is lifted off the on-call.

使通话更容易忍受的另一项工作是在发生操作或部署时暂停警，并将其传达给团队。结果，该工程实践迫使处理该操作的工程师对该服务负责。如果在该操作过程中确实出现问题，则由工程师负责。减轻了通话中的负担。

通话中的好处 (Benefits of On-call)

Although on-call is not fun, there is a lot of learning experience from it. Being on-call forces you to interact with services that you are unfamiliar with. There were some instances where I had to delve into the service’s codebase. Now, I am not an expert on that particular service, but I can participate in discussions about it. I no longer have to be confused about what other engineers are talking about!

尽管通话并不有趣，但可以从中获得很多学习经验。随时待命会迫使您与您不熟悉的服务进行交互。在某些情况下，我不得不深入研究服务的代码库。现在，我不是该特定服务的专家，但是我可以参与有关该服务的讨论。我不再对其他工程师在谈论什么感到困惑！

Another benefit is you become a faster problem solver. Because issues are time-sensitive, you have to resolve them quickly, or else the issue gets escalated to the higher-ups. You learn how to debug an application quicker. You learn how to read the stack trace properly. These are essential skills to develop to become a proficient engineer.

另一个好处是您可以更快地解决问题。由于问题是时间敏感的，因此您必须快速解决它们，否则问题会升级为更高级别的问题。您将学习如何更快地调试应用程序。您将学习如何正确读取堆栈跟踪。这些是成为熟练的工程师所必须具备的基本技能。

最后的想法 (Final Thoughts)

Software engineering is not just working on code to push out new features. For full-stack and back-end engineers, it also involves monitoring the health of your services and taking steps to maintain that health, which is no easy task.

软件工程不仅在开发代码以推出新功能。对于全栈和后端工程师而言，这还涉及监视服务的运行状况并采取步骤维护该运行状况，这并非易事。

So the next time if a person says that software engineers have it easy, ask him or her to step in your shoes and do on-call. See how long

因此，下一次如果有人说软件工程师很轻松，请他或她穿上鞋子并进行呼叫。看多久

Yen is an engineer at Twitter who also has worked with numerous clients on ramping up their technical skills for jobs with over 350 client sessions.

Yen是Twitter的工程师，他还与众多客户合作，通过350多个客户会议提高了他们的工作技能。

翻译自: https://towardsdatascience.com/oncall-is-50-of-software-engineering-ca0e79f2dd97