Well, we have talked about mentoring in the past, but we run into issues with it. It's hard for someone just learning to spend tons of time being mentored and we can't (and don't want to!) force them to either. So we don't know who is going to be around for a while or who has time.
Not what folks are asking, but you can pick up a lot just by hanging out and watching the on-call person mutter to themselves in IRC ( not sure that is the right phrase umm... ) I don't know if this will translate for everyone, but there is a concept of "rubber duck" problem solving, where if you have a particularly difficult issue, and you explain it to someone, it helps you solve the problem more easily. I don't know if this is how everyone works and it really doesn't help if someone is jumping up and down and quacking while your are trying to think. I guess my point is, just hanging out unobtrusively when you can is fairly helpful all around.
do some kind of intensive mentoring>
I wonder what it would be like if there were an apprentice slot for each oncall shift?
For example, an apprentice is scheduled to be paged with the oncall sysadmin, then at the least can shadow the syasadmin, help with communication on IRC, do initial troubleshooting & monitoring with the apprentice auth/access level, etc.
If I recall, alerts are pretty easily accessible. You can poke around on Nagios if there are issues. Obviously if everything is down/red, it's not a good time to ask for help with your ssh access.
Well, depends on the stuff I guess. It means that the oncall person not only has to watch for and respond to pings on irc and triage tickets, but then explain/teach a apprentice. If there is time and willing apprentice thats great! If things are really busy, it could be too much work to do at once.
I'm open to ideas here... we should all try and figure better ways to get stuff done and learn and have a good time doing it. :)
kevin
A couple ideas;
- stream your terminal session when working an outage ( could be hard to find a 100% foss version that is secure ) - plan some outages in stage for apprentices to work at some time when tickets are low and nothing urgent is planned ( I don't know that I've ever heard of such a time, but in theory it could exist )
Of course, it's late here, so this may all turn out to be nonsense, but good discussion anyway.
-Zach #aikidouke
On 05/15/2018 08:12 PM, Zach Villers wrote:
Not what folks are asking, but you can pick up a lot just by hanging out and watching the on-call person mutter to themselves in IRC ( not sure that is the right phrase umm... ) I don't know if this will translate for everyone, but there is a concept of "rubber duck" problem solving, where if you have a particularly difficult issue, and you explain it to someone, it helps you solve the problem more easily. I don't know if this is how everyone works and it really doesn't help if someone is jumping up and down and quacking while your are trying to think. I guess my point is, just hanging out unobtrusively when you can is fairly helpful all around.
Yeah! https://en.wikipedia.org/wiki/Rubber_duck_debugging
But yes, we already do talk about what we are doing and whats going wrong and what the fix might be in IRC. Everyone welcome to watch/ask questions.
If I recall, alerts are pretty easily accessible. You can poke around on Nagios if there are issues. Obviously if everything is down/red, it's not a good time to ask for help with your ssh access.
Indeed. Yep. Nagios is pretty available to all.
I can try and make sure I note exactly what I am doing to clear an alert... sometimes I am bad about saying "fixing that" or "poking that" without saying what exactly is going on.
A couple ideas;
- stream your terminal session when working an outage ( could be hard to find a 100% foss version that is secure )
Yeah. ;( There are things like tmux that might make this possible. However, we do want to make sure someone else cannot take control of our sessions. :)
I did look for some screencast type software for command line a while back and was disappointed that all of them needed some non free or website to view/decode. ;( I guess there is always 'typescript'
- plan some outages in stage for apprentices to work at some time when tickets are low and nothing urgent is planned ( I don't know that I've ever heard of such a time, but in theory it could exist )
Ha. Yeah, thats a nice idea.
One thing I would very much like to do is move all the *stg* services to a noc01.stg instance that doesn't page, just irc and non urgent email. Once thats seperate we could indeed try and run some alerts. :)
Of course, it's late here, so this may all turn out to be nonsense, but good discussion anyway.
No no, it was great.
I think discussing this is good... and hopefully we can get folks more involved (however that happens).
kevin
On 05/16/2018 07:23 PM, Kevin Fenzi wrote:
If I recall, alerts are pretty easily accessible. You can poke around on Nagios if there are issues. Obviously if everything is down/red, it's not a good time to ask for help with your ssh access.
Indeed. Yep. Nagios is pretty available to all.
I can try and make sure I note exactly what I am doing to clear an alert... sometimes I am bad about saying "fixing that" or "poking that" without saying what exactly is going on.
That's a good idea, and I noticed you already started doing this, thanks Kevin.
To expand a bit on this, I logged/explained in more detail [1] what I did when working on one random issue. Since it was in staging, I could take some time to copy relevant commands and write down my thought process. Apprentices, if comments like this are useful, please let me know.
[1] https://pagure.io/fedora-infrastructure/issue/6918#comment-512552
infrastructure@lists.fedoraproject.org