Authors: Gene Kim, Kevin Behr, George Spafford
The Phoenix Project: A Novel about It, DevOps, and Helping Your Business Win Gene Kim, Kevin Behr, George Spafford 2013
DevOps novel (The Phoenix Project)
The Phoenix Project: This book is the story of a fictional enterprise, a kind of production novel. Using this example, the authors of the book tried to show why it is so important that IT specialists actively interact with employees of all other departments in the company, and how to properly organize such cooperation. Authors, specialists, and consultants in IT management are guided by the management methods of DevOps, Agile, Continuous Delivery, and the Toyota production system. DevOps – the main method that is described in the book, involves the most active interaction in the company of IT development specialists with IT service specialists, the mutual integration of their work processes. Effective development and operation of software and the entire work of the IT sector are impossible without the cooperation of IT specialists with all other departments of the company. The lack of competent IT management leads to the fact that the IT sector is buried under the rubble of heterogeneous tasks coming from all departments of the company, which leads to the accumulation of huge technical debt and, in the end, completely paralyzes the work.
Our story began one September morning when Bill Palmer, head of a small IT group at Parts Unlimited, was on his way to work as usual. But suddenly his usual thoughts about everyday affairs were interrupted by a call from the personnel department with a request to come in urgently. As it turns out, Bill is called in by CEO Steve Masters, who decides to put him in charge of the entire IT sector.
Parts Unlimited is a major manufacturer of auto parts and accessories with a long history dating back to the 1920s. Its position on the market has been unenviable lately. Competitors are outperforming Parts Unlimited in every area, sales are down, new technology is slow to roll out, and staff is downsizing. There is even talk of disbanding some units.
In an attempt to breathe new life into Parts Unlimited, a couple of years ago, the Phoenix project was developed – the creation of a common platform for all orders placed both in stores and via the Internet. But “Phoenix” is also stalling: the time of its implementation has long exceeded all deadlines, and the cost has gone way beyond the budget. Phoenix’s primary preacher is commercial vice president Sarah Moulton, one of Steve’s top aides who has enabled her to draw on all available resources.
Structure Parts Unlimited
• Management Board (CEO, Board of Directors)
• Production (factories, assembly lines)
• Product distribution network (shops, warehouses, delivery)
• Business sector (accounting, finance, marketing, sales, analytics)
• IT sector (software development, IT maintenance, information security)
In the IT sector, IT Operations handles software and computer maintenance across all Parts Unlimited divisions. Such departments exist in most companies today. Their employees are system administrators and user support, operators. The IT Development department creates new applications at the request of top management and tests them, and then transfers them to IT service for deployment in a real production environment. The Information Security Department (InfoSec) monitors compliance with IT security regulations.
Ups and downs
The new position and new problems
At Parts Unlimited, the Chief Information Officers (CIOs) were continually fired. There was even a joke that the appointment to the position of CIO is the end of a career. This time the company was left not only without a CIO but also without a head of IT Operations.
Bill Palmer understood the complexity of working in IT services: even he, the leader of a small team, had to constantly listen to complaints about network failures and non-working computers. He wanted to refuse the promotion, but Steve, skillfully managing the conversation and remembering his previous merits and his army past with Bill, forced him involuntarily to agree. Yes, and an increase in salary would not hurt Bill, because he and his wife still have not paid off the loan for the house and are raising two sons.
After congratulating Bill on his promotion, Steve immediately throws him to the front line – another “level one failure” has occurred: the payroll program has collapsed, and people need to transfer money to five in the evening today. If this is not done, the company will violate labor laws, unions will raise a fuss, inspections will begin, and the reputation of Parts Unlimited will deteriorate completely.
First steps: chaos and confusion
Running to the finance department, Bill finds complete chaos there: employees manually recalculate statements and enter data into a hastily made adjustment program that the developers sent them, bypassing IT services. It turns out that all data on hourly work has been reset to zero, and instead of data on employees, unreadable characters pop up in some fields.
Bill tells management that he will try to do his best but advises them to prepare for plan B: to make payments on last month’s payrolls. But then it may turn out that someone will be paid the wrong amount, new employees will be left without a salary, and those who quit will be paid extra money, which will then have to be returned. In addition, a financial audit is on the nose, and for such tricks, the chief accountants and financial director Dick Landry will be very badly off.
Bill returns to his building – one of the oldest and most neglected in the company, which, as it were, hints at priorities. In the network operations center, he finds his former colleagues, now subordinates: Wes, head of distributed operations, and Patty, head of service support. Together with other employees, they raise their voices to discuss the fall of the storage area network (SAN). It turns out that last night the message about a failure in the payroll system came during a SAN upgrade. Wes’ lead engineer, “irreplaceable” Brent, suggested that the SAN had corrupted the data and suggested that the update be rolled back. After that, everything didn’t work anymore.
After quickly informing Wes and Patty of his new status (Steve never bothered to officially inform everyone about the appointment), Bill gets to work. I had to call the all-knowing Brent again, tearing him away from work on Phoenix.
As a result of a long and painful search, it turned out that one of the developers “quickly” installed the anonymizer application 1 at the request of the head of the information security department, John Pesch, after which he went on vacation. And in fact, “crakozyabry” appeared only in those fields where the personal data of the company’s workers, whose security, according to John, was at risk, where indicated.
Bill is very serious with John, whom he perceives only as a nuisance. Everyone in IT services is accustomed to thinking that “security guards” are paranoid, who is obsessed with absurd rules, and only make everyone do extra work. John claims that he tried for the good of the company because soon a security audit is also on the way. If the auditors find “holes”, as has happened more than once, this threatens Parts Unlimited with huge fines. And since IT Operations did not respond to his requests for months, he decided to install a third-party personal data encryption program, “absolutely reliable,” according to the manufacturer. When asked by Bill if he has tested the program, John replies that he has not, because he does not have a suitable testing environment.
To top it all off, Bill’s work laptop freezes during the next corporate system update, and Patty provides him with an antediluvian monster with a battery glued on with adhesive tape – a typical “shoemaker without shoes” situation. The equipment replacement queue is also a longstanding problem in their department.
As Bill, Wes, and Patty discuss how to fix the glitch faster, Brent clicks hard on the keyboard and announces that he’s fixed the program. It seems that the crisis is resolved, but the time for payments has passed, the accounting department took advantage of a bad plan B, and the next day the newspapers wrote about another failure of Parts Unlimited.
Attempts to organize everything and new crises
Bill was very concerned about the fact that no one in the company controls the many changes to the current software. At a general meeting of the IT sector, he ordered the creation of a system for monitoring all changes. Patty said that she has been trying to implement a patch and edit monitoring system for a couple of years, but with no success. People still act around, either considering the changes to be minor or at the urgent request of employees in other departments.
Delving deeper into this system, Bill realizes that it is, in addition, disastrously inconvenient. Then Bill distributes paper cards and asks those present to write down all the upcoming changes and their proposed date, and then bring these cards to Patty, who with two employees should think over a new scheme. After a couple of days, it turns out that there are almost 450 such changes, and several tables are littered with cards.
In addition to working in his department, Bill participates in regular meetings about the Phoenix Project. Commercial director Sarah is very unhappy that Bill is distracting a valuable employee Brent from her project, and does not want to hear excuses. Head of Development Chris says that if everything goes well and the virtual environment created with the assistance of Brent works well, then the project can be launched in a week. Hearing this, Bill is horrified. She and Wes are indignantly talking about how the developers are always dragging on until the last minute as if the entire project schedule was invented only for them, and then the system will still need to be implemented by the IT service. As always, there will be unforeseen complications, lack of servers, delays, interference, code rework, and conflicts of inconsistent versions.
Bill tries to talk Steve out of implementing Phoenix, but he doesn’t want to hear about it either, having already spoken to the board and even given media interviews about the upcoming changes to the company’s services. If the changes are once again delayed, then investors will not like it and the position of Parts Unlimited will become completely deplorable.
Meanwhile, Nancy’s head of internal audit informs Bill of an upcoming audit and that a list of problems that occupies a hundred sheets of small print needs to be fixed. This means that you need to allocate some part of the staff for the needs of the audit, and in fact, they are already busy up to their throats. Some accounting software servers are very old, and only “the most experienced engineer” Brent understands their work. That Brent again!
Bill takes the time to follow Brent’s workflow. It turns out that he is always busy and does not have time for anything, he is transferred from one site to another, and they even call on minor issues, although, for example, an ordinary employee would cope with reinstalling the operating system. In general, the same can be said about the entire IT service department. As Patti says, the priority of tasks here is set only to those who shout the most. Bill thinks of some sort of work-sharing system, but little comes to mind. He asks Patty to prepare a list of all the projects assigned to their key employees in order to calculate their workload and perhaps ask Steve for an increase in staff. As a result of external and internal ongoing projects, it turns out to be almost a hundred and fifty, and all require resources for almost the next year,
New acquaintances and new solutions
Bill’s brooding thoughts are interrupted by a call to meet prospective future board member Eric Reid, who appears to be a slightly mysterious and strange man. Eric invites Bill to take a drive to the production building, where they observe the transformation of raw materials into products sent to warehouses, that is, the so-called value stream. Judging by the way Eric talks in detail about the difficulties of production in the past, he was once directly involved in this. He talks about the past haphazard order picking, the accumulation of materials in bottlenecks (so-called bottlenecks) that caused the pace of the entire production to suffer, and how managers dealt with such problems.
According to Eric, three methods have proven effective: debugging a fast flow of work, reducing and strengthening the feedback loop, and a culture of constant experimentation and learning. At first, the comparison of a material production process with intellectual activity in IT surprised Bill, but then he gradually began to delve into this comparison. Indeed, software, server capacity, licenses, and orders can serve as raw materials here, and the final state of the system can serve as products. At the end of the conversation, Eric suggested that Bill thinks about the four types of work in his department and call as soon as he was ready.
Bill and Patty continue to work on the change tracking system and its formalization. They divide them according to the degree of seriousness, routine, and necessity. Non-priority changes are sent to the end of the queue, for routine changes it is proposed to develop standard mechanisms so that next time they take less time 2 . Change meetings are proposed to be made regular, and the process to track them on a kanban board, divided into the columns “Planned”, “In progress” and “Done”, indicating the best time, taking into account a load of projects. In addition, it is proposed to conduct regular exercises on failures and detection of changes that led to them.
“Irreplaceable” Brent (and in fact, a bottleneck), Bill orders to focus only on the Phoenix project and on the most urgent cases, without being distracted by any other issues. Together with Patty and Wes, they think about how to replace Brent in other cases. To do this, they decide to follow how Brent “intuitively” unravels the problems, and to make a protocol of his actions. A separate team should deal with emergencies and call for Brent’s help only in the most difficult ones, trying to remember what he did, and thereby learning. True, it turns out that some of the planned changes are being postponed due to their inaccessibility of Brent, and this threatens to accumulate new problems.
The collapse of the Phoenix
On the appointed day, the Phoenix’s deployment failed miserably. With code injection, the load on the servers increased dramatically and the deployment slowed down. Chris’s staff and Bill’s staff fought constantly over misunderstandings and a lack of clear documentation. As a result, the databases of online orders and in-store orders were not consistent, and the POS system completely froze. Dissatisfied customers terminated contracts, sellers resorted to emergency measures – they wrote out deferred checks on cards, transferring personal data through insecure channels. This led to another threat to the security of information (however, John unexpectedly helped Bill here when he distracted the auditors from the room in which scans of cards with CVV2 codes were hastily destroyed). Board members were furious, investors were disappointed. The threat of division of the company and the transfer of the IT sector to outsourcing arose again. To correct the situation, Steve gave everyone three months.
First glimpses and again clouds
After the experience, Bill and Chris (the development department) decide to have a heart-to-heart in a bar. They share their thoughts and concerns, agreeing to work more closely in the future.
Meanwhile, the change accounting system is bearing its first fruits: it is possible to identify cases of duplication of changes made by different departments. But it also reveals the catastrophic impact of the emergency implementation of the Phoenix – hundreds of planned changes have not been implemented. As a result, Bill finally establishes four types of work in the IT services department:
• business projects (on behalf of other departments of the company);
• internal projects;
• unplanned situations (failures, emergency work).
Moreover, it is necessary to strive to ensure that the last type influences the first three as little as possible.
During the next “level one outage” (bill system failure), the IT team is more confident, communicating with other departments and identifying what changes may have caused the outage. No one yells or blames others, and Bill even has time to spend the evening with his family. He also lets Brent go home so that he does not make things worse with his “intuitive” actions and that he comes only when the source of the failure is localized. But then Steve calls Bill in the middle of the night and in a raised voice orders him to urgently return to work and bring all the “lazy” employees with him. He does not want to listen to any explanations that work is going on as usual. Bill also loses his temper, announces his departure, and hangs up.
After a few days with his wife and kids, Bill agrees to talk to Steve, who asks for forgiveness and invites Bill to work for at least another three months ahead in “an atmosphere of trust, openness, and cooperation.” He realized that IT is not just a department, but the core of every effort of the company, critical to all daily operations. Bill agrees and takes part in an informal meeting of the heads of all key departments, where they all share their thoughts, experiences, and dreams, following Steve.
Method one. Process Debugging
Turning to the technical side, Bill talks about how IT services take on projects, no matter how important or feasible they are. As a result, “technical debt” is created, and all the efforts of the department are spent only on paying “interest in the form of unplanned work.”To deal with a bunch of current tasks, for a start, it was proposed to declare a freeze on all IT projects, except for Phoenix and some internal ones. This reduced workload, but Bill feared that it would increase again after the freeze was lifted, so a process for categorizing projects had to be developed. By analogy with industrial production, it was decided to distribute work among “work centers” of several people. The number of business projects in progress (WIP – work in progress) should not exceed 4-5 at the same time, plus 20% of the total volume should be occupied by internal work (infrastructure replacement, database maintenance, security, etc.). The rest of the projects remain, as it were, in the “incoming warehouse”, waiting for their turn; some of them, like updating database versions, which are still scheduled to be decommissioned in a year, is expected to be abolished altogether. We must learn to refuse to carry out obviously unfeasible and optional projects. It is important to understand that the number of resources and man-hours in the department is limited, it can only pass a certain amount of wip through itself per unit of time, and no amount of multitasking will change this. The main conclusion was that improvement in any areas other than the bottleneck (and such was both the entire IT services department in general and Brent in particular) is an illusion.
The concept of employment was also redefined when a “simple” job that Brent expected to complete in half an hour stretched out over several days, as it required coordination with other employees engaged in other projects. So another important conclusion was made – for the faster functioning of the system, employees must have a certain “idle time” in case of related and unplanned work, otherwise, it will also only accumulate in the “incoming warehouse”. The authors argue that if employee occupancy consistently exceeds 80%, wait times increase dramatically.
Of course, all projects were tracked by the kanban system, and many processes were standardized for their faster subsequent execution and the execution of similar tasks in other projects. By eliminating multitasking and focusing on a small number of truly important projects, the department was able to accomplish more in a week than it normally would in a month. And thanks to the improved monitoring and control system, it was possible to identify potential security breaches, so the need to implement new protocols, which John (InfoSec) insisted on, turned out to be not so urgent. At first, John was offended that he was considered an extra link, but after that, he began to actively cooperate with Bill.
But this was only the first of three methods proposed by Eric Reid. It is not enough to simply standardize and optimize the process of processing the current work or value stream. It was also necessary to ensure that there was no movement against the flow (for example, sending code for rework and the appearance of unplanned work), as well as that the general situation of the company was taken into account. The head of the department must think like the chief executive of the company, focus on common goals, and cooperate with other departments while maintaining an atmosphere of experimentation and learning.
Method two. Feedback amplification
Together with John, Bill met with the heads of the main departments, asked about their work, and looked at the company through their eyes. He was especially helped by a conversation with CFO Dick Landry, who also informally served as chief executive officer (COO). Dick said that although he cares about financial performance, market share, and profitability, this is not the most important thing. The main thing is understanding customer needs, the right product portfolio, speed to market, competitiveness, and accuracy of sales forecasts, that is, a broader perspective. What’s the use of great financials in the wrong market with the wrong strategy and the wrong research team? Armed with this knowledge, Bill tried to understand the work of his IT department from this broad perspective and saw the relationship of all services to it. From conversations with department heads, he better understood what they needed from IT (for example, a more efficient system for tracking goods and loading warehouses). Having shared his problems, Bill enlisted their support in the issue of allocating additional funds for IT. At the same time, he realized that the ambitious “Phoenix”, the development of which has been going on for three years, will not solve any of these problems. We need actions that bring at least a small return much faster and allow us to respond to market demands in a timely manner.
Method three. Experiments
And then the opportunity to start a similar project turned up. Sarah once again decided, bypassing IT developers and frozen IT services, to push through her way of collecting user data for Phoenix, transferring it to outsourcing. As a result, there was a failure, which, thanks to regular training and an improved monitoring system, was eliminated this time in a timely manner. At the meeting, Bill suggested creating a separate operations team to develop marketing campaigns for Thanksgiving and Black Friday. This team included representatives from the IT development, IT services, and business departments, and the project was named “Unicorn” and was a so-called deployment pipeline from code development to implementation.
The project participants calculated what each department needed to fulfill its function, identified the intersection points (environment creation, automated tests, code run, porting to another environment, quality control), and developed standard procedures for them. This reduced the role of supporting documentation, and developers were able to create packages for automatic deployment of code in test and production environments without manual control of testers and IT maintenance. By moving to cloud-based work services and getting rid of unnecessary routines, IT-service outsourced its servers to developers to create a more realistic environment. And in order to initially protect user data, they also attracted people from John’s department. All agreed to meet regularly at workshops and coordinate their activities.
Despite challenges in developing applications for online recommendations, online ordering, product information storage, and delivery, the Unicorn project saw the company finish the quarter positively for the first time in a year and a half, and inspired operations team members set a goal to bring the number of small releases to several dozen a day, as is done in leading high-tech companies, although previously it seemed to them a pipe dream of three-week releases. After automating the routine and thus freeing up time, they began to sometimes deliberately run malware in a simulated working environment to identify potential failures and conduct regular training to fix them.
Eric Reid admits to Bill that he specifically watched him and set him on the right path because he supports innovation in the field of IT management and preaches new methods such as DevOps, Agile, Continuous Deployment, and adaptation of the Toyota Production System. He abandoned the idea of becoming a board member and decided to create a hedge fund to support the best minds in the IT field.
At a New Year’s Eve party, Bill was presented with his bronze-painted old monster laptop, and Steve offered him a two-year trial period to work as the head of various departments and in three years to become the CEO of Parts Unlimited. In his opinion, the position of IT director will not be enough for Bill. The IT sector is the blood and heart of any company, and if a person does not understand it, he will not be able to become a good leader. Bill went through this school and is able to move on.
Top 10 Thoughts
1. Information technology plays a leading role in the modern business structure. The active interaction of IT specialists with each other and with representatives of other departments allows you to quickly respond to the situation and confidently control the production and management processes.
2. The workflow in the IT department is similar to the production process in a factory, only the raw materials here are program code, operating systems, licenses, server and computing power, professional skills, and the products are working software and completed tasks of other departments.
3. Work in the IT department is divided into four types: business projects (expressing the needs of other departments and administration), internal projects (maintenance of servers and infrastructure), changes (editing existing software as needed), and unplanned work (eliminating errors, failures, faults).
4. The bandwidth of the IT department is limited. By taking on too much work, he creates “technical debt” in the form of unplanned work.
5. The first way to optimize: optimize the workflow through a system for monitoring changes and automating jobs, monitoring the workflow (kanban boards), as well as identifying and eliminating the bottleneck – the narrowest section.
6. “Work centers”, that is, groups of employees, need to have free time, because they can be contacted by other “work centers” and employees when performing their tasks, thereby accumulating “technical debt”.
7. The second way: reduction and strengthening of feedback. Freezing the receipt of new tasks if it is necessary to eliminate the bottleneck and “technical debt”; creation of automatic testing and deployment systems; integration of the tasks of the development department and the service department; understanding of the overall goals of the company.
8. To respond to the market situation in a timely manner, new product release cycles must be shortened. It is better to divide the work into smaller parts that can be processed and implemented faster.
9. Third method: maintaining a culture that encourages experimentation and learning; creating an atmosphere of trust and openness; the ability to take conscious risks.
10. The third method also involves regular training to eliminate possible failures and the use of malware to test the stability of systems.
1 . In this case, the application that hides personal data
2 . Kanban board is a tool within the framework of task planning according to the kanban system. Cards with a description of the tasks are attached to the board and, as the stages of the task are completed, they are passed from one column to another.