Killing the Fail Whale With Twitter’s Christopher Fry
Christopher Fry. Photo by Josh Valcarcel/WIRED
Christopher Fry is Twitter’s 43-year-old senior vice president of engineering. He runs everything engineering-related at the company. This means he’s the guy whose job it is to make sure Twitter can handle the massive volumes of tweets that flow across its servers every time, say, Miley Cyrus learns a new dance move at a strip club. He’s a big dude — a surfer and sailor — who came to the company from Salesforce. He also did a post-doc in computational neuroscience from Berkeley, where he studied the auditory cortex of zebra finches. WIRED sat down with Fry to talk about how Twitter will continue to grow, what keeps him up at night, and to find out whatever happened to the Fail Whale.
WIRED: Is there anything about the language of song birds that you can apply to engineering at Twitter?
Fry: The interesting thing about bird songs is they’re learned. They’re this example of this complex learned behavior that’s passed down. Actually, a lot of the original work was done here at Berkeley. They studied basically the dialects of birds in the Bay Area. So there are whole maps of white crown sparrows and how their language changes across the geography of the Bay Area.
Once I left academics, I started doing startups and started moving into the technology world. But one of the things I bring to every job is this love of learning. One of the things we did this year was found Twitter University, which is really about creating this locus of learning inside the organization and building a learning organization. We acquired Marakana and got two really great founders to come in and basically build a world class technical training inside Twitter, provided for free. Every engineer could become an expert in Android or iOS. We have all kinds of different programming languages. It’s really been this incredibly fun thing to create. We want Twitter to be able to do whatever we need it to do within three months, the whole organization. The university gives us that ability to adapt and learn.
“The Fail Whale image is not served by Twitter anymore. It had a long history and some of our users feel very connected to it. But in the end, it did represent a time when I don’t think we lived up to what the world needed Twitter to be.”
WIRED: I’d assume you’d want engineers to have ownership of specific projects. Does that mean that, like, you would want your people who are on iOS to know also about Android as well, just to know it?
Fry: You know, it’s generally good if, one, people appreciate what everybody else is doing and, two, have general knowledge and can work around the systems. So, just like any system, if you have too much specialization you get brittle and you can’t change quickly. In a perfect world, everybody would be able to do everything. You obviously have specialists, and specialists are important. But to the extent that our engineers can have a high degree of aptitude in any discipline, it’s good for us. Good for the teams and good for what we need to do.
WIRED: So, do you have people who are working on multiple projects at once?
Fry: We do. It’s interesting. When we were looking at scaling out mobile, we wanted to make sure that we moved away from this one team inside Twitter building mobile products to scaling out mobile across engineering. So, what we did there was train up a bunch of people to work in Android and iOS, and then we took the mobile team and we left sort of a core team intact but put the mobile engineers out onto the different product teams so that we built a mobile capacity across all of engineering. Twitter has a long history of being mobile-first, but we wanted to extend that even more. We make sure every place we’re building a product, we’re building it onto mobile devices. So, part of what we did was, one, bring up experts in whatever it was and then, two, distribute the teams but still keep core teams that focus on the core mobile infrastructure in place. So, that’s the best long answer to your question.
WIRED: We’re hitting the point where more than half the world has a smartphone. People are coming online, many for the first time, in countries where they’re buying things like twenty-five dollar Android handsets. What type of engineering challenges does that pose?
Fry: There’s two or three things that you have to think about. One is, people are used to working on the web where you can know everything that’s happening in real time. One of the strategies you have to take — we’ve taken this and are pretty prepared for it — is building in all the infrastructure so that you have on the web onto your mobile frameworks. This gives you the ability to experiment, the ability to try things out, the ability to iterate quickly. People sometimes think about mobile products as these shipped, static products and web products as very dynamic and pliable. You have to create the infrastructure to have a dynamic and pliable infrastructure in mobile. On the web, you can track every click. To build great products you have to have that insight into mobile.
Generally, not everybody around the world has the latest iPhone or Android device. So you have to basically tailor your product to run well in places where there are lower-end devices, and maybe not as good networks, or even very unreliable networks.
WIRED: Do you engineer for the lowest common denominator?
Fry: You don’t engineer for the lowest common denominator, but you do tailor the product that you deliver to the market you’re going into. So you’ll have a team that’s focused on creating the Twitter experience for that market.
WIRED: I want to talk about scaling and stability. I read something you said that, Twitter was trying to solve its problems by throwing machines at them rather than from an engineering standpoint. Is that…
Fry: Did I say that? I don’t think I said that.
WIRED: I believe you did? [Ed note: He didn’t say that! It was Raffi Krikorian, in a blog post here.]
Fry: Twitter definitely has had scaling issues in the past, and one of the opportunities I saw coming into Twitter was both scaling out the infrastructure and scaling out the organization at the same time. Having gone through that at Salesforce, I was able to bring that learning with me. When I think about the infrastructure problems we had, there was a key problem that we had to solve which was decomposing our monolithic code base. We had a monolithic Ruby server and we were able to basically decompose that into a set of services. Then applying Mesos as that layer of indirection gives us a way to pack services onto machines to get higher utilization. We can get reliability and efficiency at the same time on top of faster developer productivity as well.
WIRED: Tell me what Mesos is if you don’t mind.
Fry: Mesos is our version of elastic compute. It sits between the hardware operating system and what developers deploy, so it gives you a scalable way to deploy services to a set of boxes. It becomes like the operating system for a data center, if you will.
WIRED: Other people are using it as well, right?
Fry: Yeah, it’s used outside Twitter. I think it’s used a bunch of places. It’s an open source project…
WIRED: You smiled when you said that. Are you proud that it’s used…
“When we think of the purpose of Twitter, what we’re able to do, making it so any person in the world can communicate with any other person, connecting all the people on the planet, that is an incredible mission to be on.”
Fry: I am, I am, I am. I think it’s currently used at Airbnb, and I was trying to come up with a list of other ones but I just don’t have a quick list. But it’s used in a bunch of places and it’s a very successful Apache project. Twitter has a long history of giving back to open source, and Mesos is one of our probably biggest open source successes right now, I would say.
Part of the Twitter service itself is the free flow of information, and so I think a lot of people that come to work here have a passion around that. Generally, inside Twitter engineering we prefer things to be open rather than closed, so where we can share we do. So yeah, it ties into the culture of Twitter itself and the product and how we build it.
There are some great benefits to open source. One is obviously you end up building quality into the product because it’s very transparent, everybody sees what’s happening. And then you get contributions back into the project, so then you can create a platform on which people can build new things and you can bring them back into the company.
WIRED: So is the Fail Whale a thing of the past now?
Fry: The Fail Whale is a thing of the past. Actually, this summer we took the Fail Whale out of production. So if you come to Twitter, and there are always gonna be problems, no service is ever perfect. But right now you will see robots instead of the Fail Whale. So the Fail Whale image is not served by Twitter anymore. It had a long history and some of our users feel very connected to it. But in the end, it did represent a time when I don’t think we lived up to what the world needed Twitter to be.
We are a service that people turn to in moments of joy, and also when things are going horribly wrong in the world. So I feel a personal commitment, as does I think does everybody that works here, to having a service that’s available when anyone needs it. And sometimes Twitter may be the only thing that’s working during a flood or during a major disaster. So we’re very committed to being the most reliable service that we can be.
WIRED: Do you view Twitter as a key piece of communication infrastructure?
Fry: I do. When we think of the purpose of Twitter, what we’re able to do, making it so any person in the world can communicate with any other person, connecting all the people on the planet, that is an incredible mission to be on. We’re probably still early in that mission, but that is the goal: that any one person can communicate with every other person in the world.
WIRED: If you say you deleted Fail Whale, then people can’t get on Twitter, it seems like that’s really opening yourself up to criticism.
Fry: We even debated internally whether we would talk about that outside of the company because we’re still going to have issues here. We have had a long period of much more reliable service which gave us the confidence to say, we really feel we’ve made a substantive difference versus just a small change in how the service is operating. There will always be issues with Twitter. When I think about things that keep me up at night, one is the reliability of the service. The other is are our engineers as efficient as they can be? Do we have all the infrastructure to make sure they can rapidly deliver code so that we can iterate on their product quickly? I think we still can. I think there’s a world of innovation that is ahead of us with Twitter, we’ve only scratched the surface and there’s way more to come. Even though we’ve accomplished a lot, I think there’s still a lot to do.
If you’re always fighting reliability fires, you’re not innovating a product. So you have to have that core infrastructure layer in place so you can then make it more efficient and iterate upon it and build great consumer experiences. I think getting reliability in place is the first step towards really doing product innovation. Sometimes, you will feel like they’re in conflict. I don’t feel that way. I don’t.
WIRED: Is that why there have been so many new products coming out recently?
Fry: I do feel like going through the steps of creating a reliable service, getting to scale, making it efficient and then creating this mobile infrastructure where we can rapidly iterate has meant that we’ve been able to do things like MagicRecs and Event Parrot. Those are two of the things that I think really represent a special experience of Twitter because they’re in the moment.
So if you take Event Parrot… it’s sometimes hard to explain what Twitter is, but when Event Parrot’s on your phone, you become the first person in the world, maybe in your network, to know about something that’s happening. So it really brings the news quickly to you and what’s happening in the world. It makes Twitter very accessible. So I think this story of going from reliability to product innovation has let us experiment with things like that.
WIRED: What advice would you give to those tasked with fixing Healthcare.gov to make it more stable and scalable? Are there general principles or practices they should follow to fix a massive product that can’t go down while it’s being fixed?
Fry: I would give the same advice to almost any software organization: stay close to the people who are going to use your product, don’t spend a lot of time writing specifications, try to iterate quickly and get to a v.1 as soon as possible. You’ll want to get your software in the hands of people that will use it. It’s important to get a steel thread of functionality working end-to-end rather than building it out in layers, so work through a single use case that has you build some UI, logic and backend. Almost all software organizations end up fixing the plane while it’s flying.