• Thanks for stopping by. Logging in to a registered account will remove all generic ads. Please reach out with any questions or concerns.

How's Your Interac Doing? (July outage & telecomm stuff in general)

My money is on worker incompetence and/or lack of system redundancy, because, that's probably the most plausible reason for the outage. I doubt it was a DDoS attack, but this sure as hell has highlighted a few chinks in the Armour to our enemies.
Unlikely. Outage affected both the wireless and wireline services so most likely was an issue on the main transport ring. The fact that it happened during the morning maintenance window when network activities are scheduled tells me it was likely a equipment vendor software or firmware upgrade on the transport servers/routers which failed and resulted in blocked communications with other key equipment.
 
Unlikely. Outage affected both the wireless and wireline services so most likely was an issue on the main transport ring. The fact that it happened during the morning maintenance window when network activities are scheduled tells me it was likely a equipment vendor software or firmware upgrade on the transport servers/routers which failed and resulted in blocked communications with other key equipment.
Yes.

Although I would point out that worker incompetence and a lack of redundancy, although not the cause, were most likely contributing factors.
 
Mistakes aren't necessarily incompetence. Very few people operate at 100% of specification 100% of the time.
 
Mistakes aren't necessarily incompetence. Very few people operate at 100% of specification 100% of the time.
After 15 years in the industry, I assure you this is false. At least the perception of it is. You're either up 100 percent of the time or you're worthless.

We burn out people on the regular because there cannot be mistakes. We accredit, micromanage, over certify, and operate within a realm of paranoia not many folks understand.

The RCCS is treated the same way regularly as Rogers was during the outage: "I don't care, get it back online or else. I have a CUB slide to complete and it's your fault if it's not done in time..."

In communications, mistakes are incompetence, both individually and systemically. You either set up a system with no checks, balances, or redundancy; or you allowed someone untrained or unsupervised to perform a critical upgrade without testing it first.

Both are big red flags in my book.
 
After 15 years in the industry, I assure you this is false. At least the perception of it is. You're either up 100 percent of the time or you're worthless.

We burn out people on the regular because there cannot be mistakes. We accredit, micromanage, over certify, and operate within a realm of paranoia not many folks understand.

The RCCS is treated the same way regularly as Rogers was during the outage: "I don't care, get it back online or else. I have a CUB slide to complete and it's your fault if it's not done in time..."

In communications, mistakes are incompetence, both individually and systemically. You either set up a system with no checks, balances, or redundancy; or you allowed someone untrained or unsupervised to perform a critical upgrade without testing it first.

Both are big red flags in my book.
Not defending Rogers but a national telecom system is a mish-mash of equipment from multiple vendors and of various generations. Technologies change over time and due to extremely widespread installation bases don't all get upgraded at the same time. Also the network has grown over time as the major telecoms absorb various smaller companies and have to interface with their legacy equipment until such time as it can be upgraded to the latest standards.

Frankly it would be impossible for a 3rd party vendor to test with 100% fail-safe certainty that a particular upgrade to their own equipment may not cause some unforeseen issue with some other part of the network simply because it would be impossible to create a testbed system that exactly mirrors the incredibly complex real system (and yes there are test centers and yes system upgrades are tested in advance).

The fact is that most network activities happen pretty seamlessly. There are literally hundreds of network activities that take place nightly and most go completely unnoticed by the public. This particular one obviously was a massive fail with serious consequences.

That doesn't take Rogers off the hook. Obviously more needs to be done to avoid this from happening again. Last year's big outage was caused by an Ericsson software update gone bad, so that should have been a pretty big signal that some type of protection needed to be put in place to be able to isolate and recover from that type of problem (assuming that was indeed what caused Friday's outage).
 
After 15 years in the industry, I assure you this is false. At least the perception of it is. You're either up 100 percent of the time or you're worthless.

After 30 years in the industry, I have my own experiences on which to draw.
 
Not defending Rogers but a national telecom system is a mish-mash of equipment from multiple vendors and of various generations. Technologies change over time and due to extremely widespread installation bases don't all get upgraded at the same time. Also the network has grown over time as the major telecoms absorb various smaller companies and have to interface with their legacy equipment until such time as it can be upgraded to the latest standards.

Frankly it would be impossible for a 3rd party vendor to test with 100% fail-safe certainty that a particular upgrade to their own equipment may not cause some unforeseen issue with some other part of the network simply because it would be impossible to create a testbed system that exactly mirrors the incredibly complex real system (and yes there are test centers and yes system upgrades are tested in advance).

The fact is that most network activities happen pretty seamlessly. There are literally hundreds of network activities that take place nightly and most go completely unnoticed by the public. This particular one obviously was a massive fail with serious consequences.

That doesn't take Rogers off the hook. Obviously more needs to be done to avoid this from happening again. Last year's big outage was caused by an Ericsson software update gone bad, so that should have been a pretty big signal that some type of protection needed to be put in place to be able to isolate and recover from that type of problem (assuming that was indeed what caused Friday's outage).
All very true, and in my professional opinion irrelevant. The "national telecom system" has been that way since the 1880s (when the fledgling CP Telegraph interconnected with Western Union in the USA to provide a "national" and even continental system. Good, even just OK engineering manages the mix of old and new equipment for multiple vendors - that's why we have standards.

Nothing is ever "fail safe" but half decent system engineering and management says that when faced with cascading failures - what I'm guessing happened, very quickly, to Rogers then the "cascade" will actually be what we used to call graceful degradation and it will not happen quite so quickly.

Software changes are notoriously risky - too many people, especially managers, expect them to be cheap and easy; the same people also believe in the tooth fairy.
 
Relaxed patio/booze rules help/encourage people to congregrate, thus reducing their usage of telecommunications and associated media. The longer they relax with booze, the less capable they become of competently using any device which employs telecommunications, also reducing network strain.
Booze can and often does lower inhibitions does it not? How many attack tweets or what ever are made under the influence?
 
Going to have to start explicitly tagging things when I think I'm being funny...
 
expect them to be cheap and easy

That's rare (non-existent, actually) in my experience. Managers fret, and have to be convinced that risks have been assessed and contingencies prepared. For clarity, my "lane" was SCADA systems (systems which monitor production assets, not the production assets themselves). Frequency of changes is overall a lot lower. In the absence of contrary evidence, I assume that there are not two different cultures of maintenance, though.

100% test coverage is an ideal, not something that actually happens for anything non-trivial. Rollback plans are on occasion executed. The existence of a fault is uninteresting; if the report the CRTC demands is publicly released - or at least a meaningful summary - the detection of and response to the fault will be interesting.
 
Back
Top