Network Billy - The world's greatest crap-looking SNMP Upload and Download tool

Saturday, 06 Jul 2019

Is it Billy Network Designer, Billy the Kids or CSNMP-Tools?

I've no idea; when I was first shown this lovely piece of software by a beaming-faced colleague, I thought it was a piece of malware at best, or some Work Experience's idea of a joke. Everything about this software looks awful - just look at the GeoCities-inspired installer bitmap:

undefined

I know, I know - "DON'T CLICK NEXT! IT'S DODGY, RUN, RUN OR YOUR..." - but keep with it (and let me know whether it's called "Billy Network Designer", "Billy the Kid", "CSNMP-Tools" or "Cisco SNMP Tools", because the Start Menu Icon, Desktop Icon and Windows Titlebar all differ from each other...), because I guarantee you this is the finest tool you'll ever use:

undefined

Have you been drinking toilet cleaner again?

Drinking, yes; Toilet Duck, no.

This software is fantastic for NetOps-types because of the following killer features:

  • Cisco Tools -> Syslog and Debug Console
  • Cisco Tools -> Cisco Snmp Tool

This software is horrific (for anybody) because of the following (whatever-the-opposite-of-killer-is) "features":

  • Map Mode
  • Network Design
    • One minute in, and all of Visio's line-misalignment quirks are forgiven

Don't use it for those (although it's unusable anyway). Do use it for the SNMP Tool - it's the finest pre-DevOps thing (I know, I know - you cool kids can do this with a Python wrapper and Bash Script) to JFDI a config restore or download the running config - via SNMP, not CLI.

Wait, did you say SNMP... for config files?

Yes, yes I did:

undefined

The reason this tool is the ultimate "Get Out Of Jail Free" card (or for those of you in the Field, "Get To Pub On Time" card) is because of the ability to use this to do the following:

  • Download a running-config or startup-config, via SNMP, using the RO or RW Community
  • Upload a running-config or startup-config, via SNMP, using the RW Community
  • Remotely reboot a Network Device, via SNMP, using the RW Community*

* = As long as you had the pre-thought to have the following CLI command in the active running-config:

snmp-server system-shutdown

Why this is so great

Sometimes, Cisco Network devices like to lock-up or have a memory leak. They'll continue to pass traffic (Data Plane), but their management instrumentation (Control Plane) - such as SSH or Telnet - will lock-up or die completely, leaving you up the creek without a "conf t" paddle. I've had this plenty of times, with plenty of ages of kit (older C3524-XL's; C3750-X; C2960 - probably nearly one of every non-NX-OS animal) - and when it's in some remote Campus or Branch Office somewhere, it's not always possible to get a skilled engineer/Console cable there, nor to get someone in to reboot the kit manually.

However, with Network Billy (as we've started calling it), you can quickly send an SNMP-based reboot to it; or even upload the specific delta command you wanted to apply with SSH/Telnet anyway. Other than this Control Plane lock-out situation, it can also bail you out for:

  • That time when you change the "ip domain-name" but forget to regenerate/zeroise the RSA keypair, and SSH hangs on you
  • That time when you change it to SSH Version 2, but it's crap old kit, so it goes to the unsupported SSH Version 1.99 - which no Terminal Emulator on earth seems able to connect to
  • That time when you lock yourself out of SSH and Telnet login, because you thought it'd be a good Friday to try out the TACACS+/RADIUS/AAA change you've been wanting to do for a while
  • That time you've got a Layer 2-only Switch, which only allows you to have one Management SVI, and it's time to re-IP it
  • That time when you accidentally applied an ACL to block SSH and Telnet from the WAN-side, instead of the LAN-side
    • This one needs you to get creative, and RDP to a Desktop asset behind the LAN-side of the Switch, and run Network Billy from that to the LAN-side Default Gateway/SVI

I could go on - in short, I owe a few of my mortgage payments to Network Billy, and it's mysterious Turkish creator.

Disappearing DNS entries when your CNAME TTL differs from your PaaS Provider's

Saturday, 26 Jan 2019

The dreaded "Page can't be displayed" error

Most people in the field of IT or Networking will have seen this lovely Internet Explorer error, and immediately recognised their day was about to change course away from the schedule:

undefined

The why can vary massively; for this blog post, we'll look at one case in point - what happens when your DNS Time to Live (TTL) record, on your CNAME, doesn't match-up with your Platform as a Service (PaaS) provider's A Name. But first, a bit of background here - names changed to protect the innocent.

The Scenario with the PaaS Provider

We've got a Web Application that we've decided to farm-out to a PaaS Provider, which used to be on-premises (or "on-prem" for you cool Cloud Kids). It's very important to the Business, but for the purpose of technology employed it's nothing special - think a HTTPS Website, where the PaaS provider does DNS-based "Elastic" (boing!) Load Balancing - also known as GSLB, but the new Cloudy World has to re-invent the terms we're already used to... *grumble* *grumble*

Let's throw in some made-up pseudonyms to anonymise this a bit, and add some context:

  • My Employer (Enterprise Business, or "the Business")
    • Name - MyCompany Ltd
    • Main External URL/Domain - mycompany.com
    • Main Internal URL/Domain - prod.mycompany.uk
  • PaaS Provider
    • Name - PaaS Co. Ltd
    • Main PaaS URL/Domain - paasco.com
    • Cloud Environment Name - PaaSCloud
    • Use Load Balancers from - BigAssLoadBalancers (Vendor)

Because the Business (rightly) thinks that a new PaaS URL of https://bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com might not be as easy-to-remember as the old on-prem (yes, I'm trying to bait you with that phrase) one of https://appname.prod.mycompany.uk; and because we've got no choice about the PaaS URL, we've taken the decision to make a new sub-domain of *.paascloud.mycompany.com. While we're there, we think we'll sort out the outmoded concept of Internal (prod.mycompany.uk) vs External (mycompany.com) URLs, because this is all hosted off-prem anyway; so it's technically no longer part of our "internal" Domain.

Regardless of PasS Co, MyCompany uses Internal DNS that sits on Active Directory Domain Controllers; for the sake of ease, I'll call this "Internal DNS". MyCompany outsources it's Internet DMZ Data Centres to another MSP; we'll call them MSPCo. MSPCo's only relevance here is that they run our External DNS/Domain (from Internet-facing ns1.mspco.com DNS Servers), whereas we run our Internal DNS/Domain AD-DC DNS Servers. Or, in short:

  • MyCompany
    • Run Internal DNS Servers (i.e. pdc1.mycompany.uk) that are authoritative (but not advertised to Internet) for *.mycompany.uk
  • MSPCo
    • Run External DNS Servers (i.e. ns1.mspco.com) that are authoritative for *.mycompany.com

To give us an easy-to-remember FQDN for the AppName Web Application, we've setup the following which means it will be https://appname.paascloud.mycompany.com:

  • Sub-Domain Space (for all Apps on PaaS Co)
    • *.paascloud.my.company.com
  • Current PaaS Web App (one of the Apps on PaaS Co)
    • appname.paascloud.mycompany.com
  • Internal DNS (MyCompany, i.e. pdc1.mycompany.uk)
    • Authoritatively Resolve requests for *.prod.mycompany.uk
    • Conditional Forward requests for *.paascloud.mycompany.com to ns1.mspco.com
  • External DNS (MSPCo, i.e. ns1.mspco.com)
    • Authoritatively Resolve requests for *.paascloud.mycompany.com

The Problem with DNS Recursion

All that we've achieved above is a series of "forwarders", such that, for the worst case (Internal Client), they'll do this:

  1. Lookup appname.paascloud.mycompany.com against Internal AD-DC DNS (i.e. pdc1.mycompany.uk)
  2. Internal AD-DC DNS Condtional Forwards this to MSPCo External DNS (i.e. ns1.mspco.com)
  3. MSPCo External DNS (i.e. ns1.mspco.com) resolves this to a CNAME of bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
    1. MSPCo External DNS (i.e. ns1.mspco.com) then Recursively Resolves this against it's upstream DNS Provider (let's say dns1.bigisp.com)...
    2. ...Which queries the Root DNS Servers (i.e. a.root-servers.net), which tell it to ask the PaaS Co Authoritative DNS Servers (i.e. ns1.paasco.com) for the A Name associated with bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com...
    3. ...Which comes back from PaaS Co DNS Servers (i.e. ns1.paasco.com) as Public IP Address 203.0.113.234 (not real, check out RFC 5737 - IPv4 Address Blocks Reserved for Documentation)
  4. Internal AD-DC DNS replies back to the Internal Client, for a request of appname.paascloud.mycompany.com, with:
    1. (The CNAME) bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
    2. (The A Name) 203.0.113.234

Phew, there's a lot of steps eh? But at least we're out of the woods now, the client has the IPv4 Address it needs, so what's the "Page not Displayed" thing all about?

Pesky DNS TTLs

Here's the bit where the hierarchy of recursion in DNS starts to 1-up you, and the bad day kicks in - perhaps as known all-too-well by these graffiti artists:

undefined

Firstly, a caveat - all of the below may be different for your scenario, depending on how MSPCo DNS Recursion is/isn't setup.

If we make use of the lovely nslookup tool on Windows, here's what we can deduce for our good response (i.e. when the page actually displays, rather than the dreaded IE "Page not Displayed" error). Remember that pdc1.mycompany.uk is my Internal DNS Server (for this example anyway, in reality AD has a Parent/Child Regional Domain Controller hierarchy, so each Client uses a different AD-DC):

C:\Users\NervousAdmin>nslookup
> set debug
> server pdc1.mycompany.uk
<snip - goes off and resolves pdc1.mycompany.uk to IP 10.0.1.99>
> appname.paascloud.mycompany.com.
Server: pdc1.mycompany.uk
Address: 10.0.1.99

------------
Got answer:
 HEADER:
 opcode = QUERY, id = 24, rcode = NOERROR
 header flags: response, want recursion, recursion avail.
 questions = 1, answers = 2, authority records = 0, additional = 0

 QUESTIONS:
 appname.paascloud.mycompany.com, type = A, class = IN
 ANSWERS:
 -> appname.paascloud.mycompany.com
 canonical name = bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
 ttl = 7200 (2 hours)
 -> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com 
 internet address = 203.0.113.234
 ttl = 60 (1 min)
<snip>
------------
Name: bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
Address: 203.0.113.234
Aliases: appname.paascloud.mycompany.com

Given the response above is good (when everything is working), what does the above tell you? If we focus on the TTL sections, you'll see Windows has cached two responses here:

  1. appname.paascloud.mycompany.com -[CNAME]-> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com, cached for 7200 seconds (or 2 hours)
  2. bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com -[A Name]-> 203.0.113.234, cached for 60 seconds (1 min)

So what happens in 60 seconds, when that A Name expires then? Let's find out - the ">" shows you are within nslookup, so just hit the Up key, and Enter to re-lookup "appname.paascloud.mycompany.com." (as per prior posts, the appended dot means "just this exact FQDN, and no additional DNS Suffixes"), eventually you'll notice the bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com section goes to a TTL of 0:

> appname.paascloud.mycompany.com.
Server:  pdc1.mycompany.uk
Address:  10.0.1.99
<snip - only interested in the CNAME ttl section>
ANSWERS:
<snip>
-> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com 
 internet address = 203.0.113.234
 ttl = 0

But you'll notice your browser access to https://appname.paascloud.mycompany.com works fine during these tests; until you do the nslookup again, after the "ttl = 0" response. Now, there be dragons.

Uh-oh, where's my response gone?

When you refresh again, your heart will drop, your bum will tighten, your browser access to https://appname.paascloud.mycompany.com will stop working, and you'll see this:

C:\Users\NervousAdmin>nslookup
> set debug
> server pdc1.mycompany.uk
<snip - goes off and resolves pdc1.mycompany.uk to IP 10.0.1.99>
> appname.paascloud.mycompany.com.
Server:  pdc1.mycompany.uk
Address:  10.0.1.99

------------
Got answer:
    HEADER:
        opcode = QUERY, id = 28, rcode = NOERROR
        header flags:  response, want recursion, recursion avail.
        questions = 1,  answers = 1,  authority records = 0,  additional = 0

    QUESTIONS:
        appname.paascloud.mycompany.com, type = A, class = IN
    ANSWERS:
    ->  appname.paascloud.mycompany.com
        canonical name = bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
        ttl = 6926 (1 hour 55 mins 26 secs)

<snip>
------------
Name:    appname.paascloud.mycompany.com

Which will give you your dreaded "Page not Displayed friend", for exactly another 1 hour, 55 minutes and 26 seconds.

And how do I know that? Because that's what the TTL says that CNAME entry will stay in your cache for - regardless of the fact your Windows Client hasn't had a recursive response of the actual IP Address that it ultimately resolves to (203.0.113.234).

So what's the fix? Firstly, lets touch on DNS TTL. This isn't much different to IPv4 TTL; it just means that, once the TTL hits 0, the entry will be purged from your local DNS Cache. What happens next is the crucial part, dictated by the "DNS Response Hierarchy" your response had; if it's just a straight single-level hierarchy (i.e. domain.com -> 203.0.113.1), then your Client will go off and re-request the DNS Request to lookup domain.com to an IP Address.

But our case is different, and not in a good way - our "DNS Response Hierarchy" looks like this:

  1. (Parent) Fetch appname.paascloud.mycompany.com
    1. (Child) If you got here, now fetch bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com

But our TTL's look like this:

  1. (Parent) appname.paascloud.mycompany.com = TTL <bigger than "Child">
    1. (Child) bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com = TTL <smaller than "Parent">

That's not what we want at all; given these are two differing DNS Administrative Domains (owned and operated by two differing Companies - MSPCo for appname.paascloud.mycompany.com and PaaS Co for bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com), we (MyCompany) don't have any direct control over these. Regardless though, we need them to flip-it-around so that this happens:

  1. (Parent) appname.paascloud.mycompany.com = TTL <smaller (or same) than "Child">
    1. (Child) bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com = TTL <bigger than "Parent">

This way, when the "Parent" (initial, or root, or "actual FQDN I wanted the IP for") TTL expires, it will remove the "Child" (CNAME) entry with it; which means the DNS Lookup process will re-occur, and we'll happily get an IPv4 Address back. Technically simple, but you try and explain that to MSPCo and PaaS Co, and you'll find your "shouty voice TTL" quickly gets towards that precious 0...

Remotely changing the Management SVI on a Cisco 3524XL

Friday, 25 Jan 2019

A Cisco 35-what-what now?

You probably haven't heard of a Cisco 3524XL. You're possibly sat reading this thinking: "I've heard of the Nexus 3K, sure, but WTF is a 3520-Seires, am I behind already?". The answer is no, you aren't (or yes, you are if you're unfortunate enough to know what a C3524XL is) - but don't take my word for it, let's ask what Danny Dyer thinks:

undefined

Why are you blogging about a Cisco Switch that went EoL over a decade ago?

Indeed, the Cisco Catalyst 3524XL went End of Life in 2002 - far before I even started working in the field of Networking. So why am I talking about it here? Well, a few reasons:

  1. @DarrenFullwel challenged me to on Twitter
  2. It's got lessons to teach us all
  3. History needs to remind us that banging on the suffix "XL" should only be confined to fast food and t-shirts

Let's focus on what it can teach us - first, a little primer on my chief bugbear with it as a "capable Layer 3 Campus Access Switch".

The C3524XL only supports one SVI

That's not too bad you might think; you probably only want to give it a Management IP Address to the SVI, and let something more capable handle inter-VLAN Routing. But what happens when you want to do something like this:

  1. Remotely re-IP Address the Management IP (and the boss won't let you hire a van and take the day to drive to the arse-end of nowhere)
  2. Remotely change the configuration your colleague left with it using VLAN1 as the SVI, but everywhere else uses VLAN55 for Switch Management (and the boss still won't let you hire that van)

Any ideas on how you're going to sort that out, remotely? Let me introduce you to the age-old Network Engineering practice of...

Squeaky bum time

undefined

There's nothing for it, soldier; we've got two basic choices to do this remotely, and we're gonna need a stock of toilet roll for both:

  1. Use a SNMP-based config upload tool like Network Billy (coincidentally the finest thing to have come out of a GeoCities website)
  2. Use a TFTP-based config upload tool (like TFTPd32)
  3. Keep hassling the boss for that van

I went for option two, TFTP-based; but the basic concepts are the same. Firstly, we're going to double-check what we want to achieve; for my scenario, that's two things:

  1. Disable VLAN1
  2. Migrate the Management IP to VLAN55 (172.31.0.0/24)
    1. I'll also have to change this upstream, so that my L3 Default Gateway Switch/Router moves 172.31.0.0/24 from VLAN1 to VLAN55, or have both co-exist for a while and VRF Lite one VLAN off from the other; but that's for another blog post

To do this interactively, I'd want to do something like the following:

conf t
int vlan1
 no ip address
 no desc
 shut
vlan 55
 name Mgmt_VLAN
int vlan55
 desc Management VLAN
 ip address 172.31.0.99 255.255.255.0
 no shutdown
ip default-gateway 172.31.0.1
end
wr mem

But we don't have that luxury, so we'll go for a three-step approach.

Step 1 - The interactive bit

We need to setup the VLAN (just at Layer 2) ready to go; as we're talking about an archaic C3524XL, depending on the age of IOS on the Switch, that's either going to be the "new Cisco way" (as above), or if you're as unlucky as Dyer thinks, the old VLAN Database method, like this:

C3524XL#vlan database
vlan 55
exit

Regardless of which, we'll then check we've got the VLAN ready to go, and if necessary, add it to any 802.1q Trunk interfaces up to the Core (L3) Switch:

C3524XL#sh vlan id 55
C3524XL#sh int trunk | inc Span|Port|55

Now onward to the offline part.

Step 2 - The offline bit

Firstly, we need to grab the config file off the C3524XL. If you've got TFTPd32 running on your PC (which needs to be accessible from the existing C3524XL VLAN1 SVI IP Address, say your PC is 10.0.0.99), this is just a matter of turning TFTPd32 on, configuring it to a directory and ensuring Winblows Firewall isn't blocking inbound TFTP (UDP/69). Then login to your C3524XL, and do something like this to copy the config from the Switch to your PC:

C3524XL#copy run tftp://10.0.0.99/c3524xl-confg
yes

Now you have the file locally, we'll be editing it in a text editor to make the changes above, and turn it into the startup-config (for the sake of space, I'm only showing the changed lines; the rest of the config needs to be there, you are only Find-Replacing these sections):

<snip - rest of config removed, but would be there>
hostname C3524XL
<snip - rest of config removed, but would be there>
int vlan1
 no ip address
 no desc
 shut
int vlan55
 desc Management VLAN
 ip address 172.31.0.99 255.255.255.0
 no shutdown
<snip - rest of config removed, but would be there>
ip default-gateway 172.31.0.1
<snip - rest of config removed, but would be there>

A few handy hints here:

  • Make sure all your interconnect, Trunks and Management SVI VLAN55 are set to "no shutdown"
  • Triple-check that in your scenario it is actually VLAN 55 for Management; the IP Address is correct and doesn't conflict & VLAN55 exists and would be allowed on the Trunk

Nothing left now but to execute our actions and make rocket go now!

Step 3 - The bit you make a calming brew beforehand for

Now it's crunch time. You've obviously got an RFC Change Request that's approved to do this (because you wouldn't "Lab on Live", would you?), so what's to fear, eh?

Firstly, we upload the amended config file, straight into startup-config:

C3524XL#copy tftp://10.0.0.99/c3524xl-startup.txt startup-config

Then we get paranoid and double-check it copied everything correctly, that we're definitely Trunking that VLAN55 and we've set the Management VLAN 55 to "no shut":

C3524XL#sh start
C3524XL#sh vlan id 55
C3524XL#sh int trunk | inc Span|Port|55

And finally we sup-up that brew, clench the derriere, and invoke the outage-causing Management IP switchover:

C3524XL#reload
yes

Then we wait, and nervously set our local PC Command Prompt "ping-t" going, waiting for it to pop back up with the new Management IP address:

C:\Users\NervousAdmin>ping -t 172.31.0.99

Pinging 172.31.0.99 with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
<2-3 nervous minutes later>
Reply for 172.31.0.99: bytes=32 time=13ms TTL=64
Reply for 172.31.0.99: bytes=32 time=13ms TTL=64
[CTRL+C]

Wrapping it up

And there we go; remotely changing the Management VLAN and IP Address of a Switch that's older than time - and hopefully a useful tip if you have a similar single-SVI-only piece of sh... kit. Enjoy!

When BGP AS-Override goes the wrong way

Sunday, 13 Jan 2019

BGP AS-Override

Much like my post on when BGP SoO goes the wrong way, I seem to have a problem with directionality of commands on Cisco IOS - this time, with BGP AS-Override. I came across this in an Enterprise Network (the same kind where we say "MPLS" but actually mean "IP VPN we buy from someone else"), where the ISP we used had an offering they called "Shared Access" - which basically means they'll let you hook an Access Circuit into someone else's IP VPN/VRF with them, as long as you, the ISP and the "VRF Owning Company" co-sign an agreement saying it's allowed.

Why might you want to do this? Think along the lines of Extranets, and furthering the idea that "Everything is just a Line Card" across Company boundaries; particularly useful if you work in the Large Enterprise and Public Sector space, as here there are often strange agreements where multiple Managed Service Providers (MSPs), Systems Integrators (SIs) and sometimes even Service Providers (SPs) (reluctantly) come together to offer a common "Service" back to either the General Public, or perhaps some large Industry Sector. Regardless of the why, the problem is normally the same old BGP-over-VRF limitation - if you use the same ISP for multiple IP VPNs/VRFs, and have end-to-end BGP reachability, BGP doesn't know to turn off it's split-horizon-based-on-ASN functionality; because it just sees the same ASN twice in the AS_PATH, rather than "knowing" that the AS_PATH consists of two differing VRFs/Routing Domains.

The Scenario Topology

undefined

This is the Scenario Network Topology, showing:

  • 2x My Network MPLS CE Network Customer Edge (CE) Router
  • 4x MPLS SP Network Provider Edge (PE) Routers
    • 2x Connected to My Company Network IP VPN/VRF VPNN123456
    • 2x Connected to Other Company Network IP VPN/VRF VPNN654321
  • 2x eBGP Peering from My Company Network < -> SP MPLS PE Router, connected to My Company IP VPN/VRF VPNN123456
  • 2x eBGP Peering from Other Company Network <-> SP MPLS PE Router, connected to Other Company IP VPN/VRF VPNN654321
    • 1x "Foreign Network" CE Router @ My Company Data Centre
  • AS-Override applied on My Network MPLS CE Network (CE) Router (towards My Company IP VPN/VRF VPNN123456)
    • Note that I am "piggy-in-the-middle"

Some notes on SP Terminology

As some of this is specific to using a Third Party SP's MPLS Network, through a "wires-only" IP VPN offering - here's a quick primer on some terminology I'm using, as this will differ between varying SP's:

  • "wires-only" - Means the SP drops a NTE/NTU in My Company's Premises, to which I attach my self-managed CE Router
    • The SP does not manage any of CE Router; I eBGP Peer direct from a Private ASN to the SP's Public ASN (or whatever they use)
    • I'm told this model is more popular in the USA than Europe (but I'm in the UK, so there are exceptions to the rule...)
  • VPNNxxxxxx - The SP-allocated IP VPN/VRF Identifier, so that they can differentiate between their various Customers (they could name their VRF instances by Company Name, but what happens when the Company changes name, or two different Companies have the same/similar names...)
  • ASN Numbers - Those on the left-hand side are My Network ones; those on the right-hand side are "Foreign" (Other Company Network) ones
    • Just like between IPsec Encryption Domains, it's a good idea to make sure these don't conflict (tricky when everyone is using the same Private BGP ASN Range)
    • It is the same Core ASN/PE-CE Peering ASN that the SP uses for all Customers
  • CE Devices - I am the Customer (or one of two), and not the SP here; I have no visibility or access to any of the PE's in this topology
    • This is a very different slant to most write-ups and blog posts I've read on the matter; everyone seems to work for an SP bar me!
  • AS-Override - This is applied at My Company end only; the "Foreign" Company are not performing AS-Override
    • So the AS_PATH they "advertise" to me contains the raw SP ASN for their own CE-PE Peering,Their CE1 <->  PE2 and Their CE15 <-> PE66

What I thought would happen

Caveat - apparently, Cisco IOS doesn't let you use AS-Override in the Global Routing Table (GRT, y'know, the one that's not in an "address-family" command); but it sometimes does (worked on my ASR1K's), and that's not the point of this post.

Focussing on My Company Data Centre - and ignoring the "Southbound" eBGP Peering from this DC into MPLS IP VPN/VRF VPNN123456 - here's an example of the Prefix I'm looking at, received from "Foreign" Company:

CE1#172.31.0.0/24 via <DC-Router1>, AS_PATH: 65007 1234 64999

Now, if we look at the "Southbound" eBGP Peering towards My Company IP VPN/VRF VPNN123456, I want to re-advertise "Foreign" Company Prefix 172.31.0.0/24 onward, via VPNN123456, into My Company Other Campus DEF (bottom-right). Given the "as-override" command is applied towards the SP's PE Router, I expected the "find-and-replace" operation to work in a similar (outbound) manner. That is, for this configuration on my CE1 Router @ My Company Network, Data Centre ABC:

CE1#
router bgp 65432
 neighbor 192.168.0.1 remote-as 1234
 neighbor 192.168.0.1 as-override

I thought my CE1 Router would therefore rewrite it's own AS65432 (Local ASN, CE1 Router) with the SP's AS1234 (Foreign ASN, CE1 Router perspective) - so an AS_PATH that actually looks like this, to the downstream PE1 (and any other Routers) on VPNN123456:

PE1(VRF "VPNN123456") or CE99#172.31.0.0/24 via 192.168.0.2, AS_PATH: 65432 65439 65007 1234 64999

 ...but that's not how AS-Override works here.

What actually happens

It transpires the "find-and-replace" behaviour isn't working with the "find" parameter I think it is. If I use some colouring here, this will be easier to see. If we show the entire AS_PATH (including the Routers at either end, which you normally wouldn't see in BGP outputs), here's what you've got for Prefix 172.31.0.0/24 going all the way to CE1 @ My Company Data Centre ABC:

  • 64999 1234 65007 65439 65432 1234 65430

I appreciate this runs inverse/reverse to the AS_PATH that CE1 actually sees; but bear with my incorrect directional thinking here. So the part I'm focusing in on is between CE1 <-> PE1, or this part:

  • ...65432 1234...

At this point, in my head, I'm thinking "The neighbour command is applied outbound to the 192.168.0.1 SP PE1 peering, so it must use this relationship in the find-replace activity", so I'm thinking, after the AS-Override rewrite, it looks like this:

  • 64999 1234 65007 65439 65432 65432 65430

Here's the kicker

The reality is that AS-Override doesn't care about eBGP Peering relationships; it acts as a dumb "find-replace" algorithm, but it uses the eBGP Peering configuration to get it's "find" parameter, by looking at the ASN value after the "remote-as" command, so here for CE1:

  • router bgp 65432
     
    neighbor 192.168.0.1 remote-as 1234

What it then "dumbly" does is looks at the entire AS_PATH it already has, and simply replaces the <REMOTE_AS> value with it's <LOCAL_AS>, before "advertising" this out, so for CE1 it would do this instead:

  • 64999 65432 65007 65439 65432 65432 65430

Which completely broke my thinking, as I hadn't appreciated that a downstream Router could overwrite an AS_PATH entry that happened much earlier-on in the formation of the AS_PATH (i.e. for a Peering Association it wasn't involved in, so how could it dare overwrite anything to do with that?).

So what next

For the example given, we actually ended up moving all this entirely, such that we had a PE-like Router where we could control ingress/egress into both IP VPNs (and AS-Override in both directions, between both IP VPNs) - but this isn't always possible. Technologically, it's easy to look dismissively at the Scenario Topology; but if you step back a bit, you appreciate our hand was forced. As I described earlier, this is a politically complex setup, with various MSPs and SIs - and as you can see, although CE15 sits in "our" DC (actually an MSP, but anyway...), it's actually a CE Router of our "Foreign" (think Extranet) Company's IP VPN (VPNN654321); which they just so happen to have with the same SP that we have Our Company IP VPN (VPNN123456) with.

Sure, this isn't a great place to be - but (in that time-honoured phrase), "It is, what it is"; looking longingly at CCNP and CCIE Greenfield Exam Topologies isn't making this self-rectify. We were fortunate because we had the capability to entirely redesign this (something for another blog post), but if we hadn't, there's a whole manner of constraints here causing pain, such as:

  • SP won't let us reconfigure their PEs on either IP VPN/VRF (so no quick-win "Bang AS-Override on PE66 and PE1" for you)
  • Commercials mean we can't collapse-out the CE15 <-> PE66 arrangement
  • CE1 / Data Centre ABC doesn't just exist for this flow (so no quick-win "Bang the VPNN123456 eBGP Peering into a VRF Lite instance, instead of the GRT"

What's the point then?

Ignoring the goal of getting this working, this was a useful real-world exercise, as it taught me:

  1. BGP AS-Override is dumb, and will quite happily assume the <REMOTE_AS> to <LOCAL_AS> Peering is the only one that contains the <REMOTE_AS>, which couldn't possibly already be in the AS_PATH
  2. BGP is not VRF-aware; it's rules of split-horizon are there to annoy me and rob me of sleep
  3. Stop reading "neighbor" commands and assuming they imply the directionality of the thing they are doing
  4. Googling for issues like this throws up limited results, because everyone else seems to be able to access the SP PE Routers
  5. I need to flip-round the way I think about AS_PATH as "Destination-to-Source" rather than "Source-to-Destination"
Home ← Older posts