TL:DR – This conference is really great as it’s put together by a group of people who aren’t tied to any company or beholden to any set of ideals, they just love monitoring 🙂 . There was a lot of opendiscourse not only about monitoring in the technical sense but also about the human side of monitoring which isn’t something that gets a lot of press.
The main thing I learned: We need to shift our focus about what we monitor.
The conference was in the Compagnietheater which was a beautiful venue. There were only 400 people in attendance but from the many discussions that I had, they were the upper echelon of the monitoring community. I met and talked with quite a few people, including some of the top people who run monitoring operations for Apple, Yahoo, the BBC and Standard Chartered Bank. Through these discussions that I had, I learned a lot about what companies are looking for and what the future of monitoring will likely look like. These are the guys that are investing time and money into the open source projects that are used by millions of people like graphite, grafana, prometheus, icinga and are shaping the future of monitoring through those initiatives.
Below are some of the reoccurring themes and the biggest takeaways that I got from the conference.
“Infrastructure monitoring doesn’t matter anymore”.
In almost every talk someone said “Infrastructure monitoring doesn’t matter anymore”. What’s meant by this is that yes it’s still being monitored but now that most infrastructure is ephemeral, it just doesn’t really matter anymore.
Any admin or operations person who knows what they are doing will have built everything knowing it will fail and if you’ve built correctly, you shouldn’t have to really care when it does.
You shouldn’t alert on diskspace being full or on RAM being full, you should alert when your application / service / product isn’t usable anymore, that’s really all that matters. When your business can no longer operate, that’s when you wake people up and not when one container dies or even when 30 % of your infrastructure fails.
(slide credit: https://www.slideshare.net/DominicWellington/how-ai-helps-observe-decentralised-systems)
Monitor for unknown unknowns
In conjunction with the aforementioned ideology, most presenters were also saying that what matters in modern IT with complex distributed systems is monitoring for your unknown unknowns.
When you set up your monitoring initially, you set it up to monitor your known knowns e.g. what you think / know will fail
While your infrastructure is running, things fail that weren’t monitored and so you start monitoring your unknown knowns.
We always try to monitor our overarching infrastructure in order to be aware of our known unknowns and begin to understand them.
What really bites us in the ass and can ruin your business is the unknown unknowns. The things we aren’t aware of, don’t understand and don’t monitor.
Currently most everyone is monitoring for past or expected failures.
We build dashboards / sensors and consume metrics based on previous experiences. The human brain is trained to expect that what happened in the past is what will happen in the future and monitoring for these things is needed.
In reality though, those past failures should be resolved in such a way or designed into a new system so that they can no longer cause a failure of your system which, in turn, makes them even less important to monitor as the system should therefore be able to be more resilient against that failure in the future.
So with that very confusing idea out of the way, what we should be doing as operations people is monitoring the user experience with our product or application and then reverse engineering the problem if something catastrophic does happen. Once this reverse engineering is done then fix the issue so that it can’t happen again or at least won’t take your system down again.
100% uptime is just a stupid idea and irrelevant now.
There was one talk where the presenter said this and I agree wholeheartedly. A few people that I talked to about this at the conference said similar things in that you should never set impossible goals. Whether it’s for unrealistic sprints in development or 100% uptime for operations people, we should set goals that are achievable since when we do achieve them our brains are happier
Build monitoring into your development pipeline
As this was a monitoring conference and most of the people there were operations people, there was a bit of mobbing on developers for the applications that they were running. A few ideas and comments that I head quite a few times were,
- Make it easy for your developers to use your monitoring system and put their things into it
- If possible, build it in as a requirement of the application instead of an afterthought
- If you build or run a monitoring system, adoption will go way up and usefulness will increase if the person who wrote the application or built the infrastructure creates the sensors since they know that thing much better than you do. (unless you built it )
DevOps is a lie and “Observability” is just a word
This really made me laugh as it hit quite close to home. Everyone says they are devopsy until the operations have to be done by developers and then things often fall apart. There were many discussions about how DevOps and Observability are just bullshit buzzwords for describing the same stuff we’ve been doing for years. There used to be a stark differentiation between development and operations, in the 90’s, but it’s been a long time since then and we’ve all been doing one form of “DevOps” or another for some time. The same as with Observability, what we all saw as monitoring was never just collecting and storing metrics, it’s always been the whole package. To read more about this debate read this article here.
We need to rethink how we do monitoring UX
A UX developer from Datadog was there and presented his vision on how we should be doing UX for monitoring solutions and he made some really good points. We are all still focused on the traditional thresholds that are set by hand and only work for some types of metrics. We really should rethink how this is presented and how it’s set up so that we can set realistic thresholds. One example that he gave was for a graph that has peaks and valleys that are based on normal traffic and how when you set a traditional threshold it doesn’t take into account the difference in volume for specific days for example. What he created, and open sourced btw, was a way to define an “envelope” for the values and then set thresholds for standard deviations based on that. It looked super easy to use and understand and I was really impressed by his work.
Just because it’s hip and new doesn’t mean it will solve your problems – you need adoption and automation
There were 2 presenters that talked about how they moved from X monitoring system to Y to Z and that yeah it solved some problems but it always created new ones and that what you should focus on when choosing a monitoring system is it’s ability for adoption and usability by those creating what you need to monitor e.g. It needs to have robust API’s that can be used to set up and change the system, ability to create dynamic monitoring / sensors and notifications, be understandable at a glance from the UI and not spam people with dumb notifications.
As this was an open source conference everyone I spoke to said that they wouldn’t even consider a solution that didn’t have integrations for things like Service Now, Pager Duty, Slack or Grafana. These other services / pieces of software are just industry standards and any monitoring software should support them.
The human side of things
There were 3 excellent talks that focused on non-technical things.
The first was from a woman who talked about what it’s like to be a woman in development. It was incredibly insightful and gave me a much better understanding about what it’s like for women in tech. She is the team leader at Elastic for the elasticbeats and logstash products. One part of her talk was for me particularly interesting and it was how she creates a team that people want to work for, including women and moms. She said that the following were the most important practices she has found for having a diverse, productive and happy team:
- Allow people to work in an environment they can be productive in. Whether at home or in an office by themselves or remote.
- For working parents, being able to work from home can makes the difference on if she can work for a company or not. Without having had that option where she is now, she would have had to change jobs after her daughter was born.
- Flex schedules. Allowing people to change their schedule around to meet the needs of their family on a daily basis also allows mothers to more easily work around their kid’s schedules.
- Judge people based on what they produce and not how many hours they work. She works with all of her teams based on trust. She has dozens of people under her and does a quarterly review to make sure they are happy and to review their productivity but doesn’t care how much they work as long as they are producing. This has really motivated her team and has been adopted by the whole company since people want to return that favour by kicking ass and producing.
The second was from a psychologist turned IT Operations Engineer and another Operations engineer. He also had some interesting points:
- Open floor plans severely reduce productivity because we are too easily distracted and don’t feel like we are in a protected environment.
- Monitoring makes us happy because we feel like we can use it to control our environment
- Teams who are run by dictators or micro managers, have no chance for innovation
The last one was from a woman working in new york and who stressed the importance of diversity in teams. To summarize her talk, we need to get rid of our predispositions about what an engineer looks like / speaks like / acts like and focus on people’s skills rather than anything else. She also had some great tips for promoting a diverse team:
- hand up not out – give someone a boost in the right direction without telling them the whole solution
- support diverse learning styles – allow and offer different people different ways of learning new things through pair programming, schooling, conferences or whatever
- coach, don’t rescue – this is something I’ve done in the past and this ensures that my knowledge remains only in my head and that’s not good
- don’t treat people like a quota