Installing ESXi onto a Cisco WAVE 594 WAN Optimisation Appliance

Saturday, 09 Feb 2019

Installing ESXi on a Cisco Wide Area Virtualisation Engine Appliance

Why would you want to do this? No real reason, but we've been decommissioning some hardware, and it's pretty clear that Cisco WAVE Appliances are just a Compute Server, with some stuff like VGA Ports removed. Originally these Appliances were designed for CDN-like WanOp purposes, so they have extras like Cavium Crypto/Offload Cards onboard, and some SATA storage; so I thought I'd have a go at loading VMware ESXi Hypervisor onto them.

The box I have is a Cisco WAVE 594, with specifications as follows:

  • Processor - Intel Xeon X3430 @ 2.4 GHz
  • Memory - 8 GB DDR3 RAM
  • Storage - 2x Hot-pluggable 500 GB SATA 7.2k Hard Drives
  • Storage - 1x Internal 4 GB USB Flash Disk
  • Network* - 2x Intel 82574L 1 GbE Network Ports

* = Not detected by ESXi, even though they're on the VMware Hardware Compatibility List (HCL)

What have we got here, Captain?

Here's a few photos of what we've got to work with:

undefined

undefined

Inside, you'll notice an internal USB port, plugged into a 4 GB USB Flash Drive (by some company I've never heard of); outside, you'll notice I've plugged in a USB 3 Ethernet Adapter (that uses the Realtek RTL8152 Chipset).

Port-wise, all we have to play with is:

  • 1x External USB Port
  • 1x micro-USB Console Port
  • 1x RJ45 Console Port (Serial Port)
  • 2x RJ45 1 Gbps Network Ports

What you don't have is a VGA Port, or spare USB Port to plug a Keyboard into (as well as a USB Flash Disk for the ESXi HV/OS Volume), which will make it pretty hard to process the Next/Next/F11 sequence required to install ESXi.

Time to ask a friend

I was a bit flummoxed at this point, but handily a friend suggested that ESXi doesn't care about hardware changes after the fact - so I could stage all this by pre-installing ESXi onto the internal 4 GB USB Drive. Which is exactly what I did, so to do this, I:

  1. Created a VMware Workstation (I know, it's a work machine - I'm normally a VirtualBox man) Virtual Machine called "USB Test" on my Laptop
    1. Allocate this at least 2x vCPUs with 2x Cores
    2. Allocate this at least 4 GB RAM
  2. Followed this guide on How To USB Boot a VM in VMware Workstation 11
  3. Downloaded ESXi 6.5.0 ISO from VMware vSphere Hypervisor (ESXi) 6.5
  4. Inserted the 4 GB USB Drive
  5. Opened Rufus Bootable USB Maker
  6. Flashed VMware-VMvisor-Installer-6.5.0-4564106.x86_64.iso onto my 4 GB USB Drive
  7. Booted my "USB Test" VM, which boots the 4 GB USB Drive
  8. Followed the ESXi installation process and installed ESXi over the 4 GB USB Drive volume
  9. Rebooted the "USB Test" VM, and attached a "Host-only" Network Adapter to it
  10. Waited for ESXi to Boot, and receive a 192.168.85.x Host-only IP Address

Now I've got ESXi built onto the 4 GB USB, I need to tweak a few bits before I plug it into the Cisco WAVE 594. Using the Host-only NIC in VMware Workstation means I can locally navigate to https://192.168.85.x/ui/ on the same Laptop running VMware Workstation to jump onto ESXi vSphere and configure it ("Host-only" means it's a virtual network between just that VM and your Laptop's OS - Windows 7 for me - which sees it as a Virtual NIC).

Making it work without VGA

As well as any other ESXi settings - such as Hostname, vmk0 IP Address, Storage Volumes (although no point doing that until this is plugged into the Cisco WAVE 594 itself) - I'll need to tweak ESXi to output it's boot screen (VMware call this the Direct Console User Interface, or DCUI; I call it the "yellow and black ESXi boot screen", much catchier) somewhere other than VGA, as the WAVE 594 doesn't have a VGA Port.

Doing this is quite easy; what ends up happening is that a VGA-like output (i.e. the VMware DCUI) gets redirected to the Serial port, which in this case is the trusty old blue RJ45 Console port. To do this, follow the instructions on VMware's website Redirect the Direct Console to a Serial Port Using the vSphere Client:

  1. Login to the vSphere HTML Client (i.e. https://192.168.85.x/ui/)

  2. Click the Configuration tab

  3. Click Host, then Advanced Settings

  4. Search for parameter VMkernel.Boot.logPort

    1. Make sure it says default

  5. Search for parameter VMkernel.Boot.gdbPort

    1. Make sure it says default

  6. Search for VMkernel.Boot.tty2Port
    1. Set it to com1
  7. Click OK

Job done, now we can simply insert the USB Drive into the internal USB slot, connect our trusty blue Console Cable and USB Adapter into the Console Port, and set PuTTY or Screen to 115200 Baud rate*, and boot the Cisco WAVE, then wait for the ESXi Boot Messages and DCUI to flow...

undefined

* = If you want to see the WAVE BIOS boot messages, you'll have to set it to 9600 baud first, and then change it to 115200 when you get garbage characters on your screen output.

So close, but yet so far

Remember that asterisk note I wrote before, where VMware lie and say they support the Intel 82574L in their HCL? Well, they don't - and to save you time, they:

  • Don't in ESXi 5.5
  • Don't in ESXi 6.0
  • Don't in ESXi 6.5
  • Don't even when you mess around with custom and obsolete net1000e VIB driver packs

Now what, not much use having an ESXi Node with no Physical Networking on it! This is where the second brainwave clicks in; lets use that USB Ethernet Adapter we've got lying around! Luckily Jose Gomes has had exactly the same idea and created a lovely guide on using a USB Ethernet driver for ESXi 6.5 - so follow that. For me, this looked like:

  1. Download the Driver VIB for the Realtek USB Adapter
  2. Enable SSH Service in ESXi vSphere Web UI (the Service is called "tsm-ssh")
  3. Use FileZilla to login as "root", and copy-paste the VIB to /tmp/
  4. Follow VMware KB Article 2147650 to disable the newer USB Drivers
  5. Install the custom Realtek VIB, from SSH this command should do it:
    1. esxcli software vib install -v /tmp/r8152-2.06.0-4_esxi65.vib
  6. Reboot ESXi

Let's see what we get this time then, when we also plug our cheapo USB 3 Ethernet Adapter in to the front USB port (and ESXi 4 GB USB into the internal USB port):

undefined

Great Success!

There is a caveat here - I find that, on reboots, ESXi DCUI will uncheck the "Use vmnic32 for Management" box, so it won't be contactable from the Network/won't get a DHCP IP until you manually press F2 -> Login to DCUI -> Re-enable it, which isn't much use if it's remote and the power goes.

Apparently there's a fix for that here in Install ESXi on a server/laptop with only USB Ethernet with an aptly-named file called "weasel", but I've had stoat-all success in getting it to work, so it's a limitation I've just lived with.

As a side note, because we didn't run the interactive installer on ESXi while it was connected to the WAVE 594 Hardware, you'll need to manually use the ESXi Datastore -> Storage -> Adapter -> Delete Partition option to wipe the partitions of data on both the 2x 500 GB SATA Disks, and can then set them both up as "New Datastores", so they can be used to hold VMs as VMDK virtual hard drive files.

Here's a handy guide on How To Erase ESXi Disks With ESXi Host Client v3.

Have fun!

Automation - The "Script it" versus "Do it" continuum

Sunday, 03 Feb 2019

The "Script it" versus "Do it" continuum

recent tweet from @nickrusso4258 got me thinking about something I've been trying to express in my professional (don't laugh, people sometimes say I am) life for a while now, that can strike a nerve with the "Automate ALL THE THINGS!" crowd; scripting something (and by extension automating something), isn't always the right answer for an Organisation's use of Time (read: your 9-5 they pay for).

As I appreciate that not everyone is a Coder, DevOps or new-kid (some of us still get paid to be Cisco Mario; not everything is up in Toad Cloud yet...), this concept can apply a little wider than just to Developers, and even probably to the Business-y people all us IT Folk interact with on the daily. Using my finely-honed MS Paint skills (side-note: you've not lived until you've done a Network Diagram in MS Paint), here's a sexy graphical approximation of the theory:

undefined

Making stuff up #1 - Payback sweet spot

What the graph is trying to demonstrate is that the world of repetitive tasks can loosely be split into two partisan camps:

  1. "Script it"
    1. i.e. Put the additional effort (more than to just "bang another <repetitive task> out") in, and automate it/script it/somehow make it easier to perform than just doing the do over-and-over, with two tangible outputs:
      1. Completion of <the task>
      2. Automation of <the task>
  2. "Do it"
    1. i.e. Don't worry about the why, just repeat the manual steps you'd normally do and "bang another <repetitive task> out", with one tangible output:
      1. Completion of <the task>

The obvious sweet spot here is that, for a given number of repetitions of <the task> over time, eventually the additional effort of "scripting it" (the time taken to do the automation, on top of that of just <doing the task>) will eventually pay itself back, as after a given "Payback sweet spot", you've now got time back to do other stuff, which you'd otherwise have spent just doing <the task> again and again.

Alright, I'm buying it #2 - Positive opportunity cost

Or in other words, you're now in "Positive opportunity cost" - that is, <the task> is in someway automated, and you can dedicate your time to the other fifteen-million items on your "To Do" list, instead of this <task>. All is well in the world, you've automated all the things - and bar a little troubleshooting and debugging you unexpectedly have to do (i.e. when you discover your vGhetto VCB Cron Job uses a file that gets overwritten at ESXi System Reboot...), you're actually "earning time" saved through the script parallel-working the task for you.

Bully for you; your life is complete, you've moved all the things to teh Cloudz, and you're about to marry Princess Toadstool, and live in the Kingdom of the Mushroom Cloud forever mor...

Wait a minute, what's this #3 - Negative opportunity cost

But look over there on the left-hand side of this conceptual model; what's that pesky "Negative opportunity cost" all about then? I'm just about to pop the ring on Princess Toadstool, you saying I've got a problem here?

What I'm referring to here is the cold reality of Work; you're ultimately paid to produce output that a Customer wants - whether that's direct tangible stuff ("Hi, make this Network Switch go now please") or otherwise intangible stuff ("Hi, move these Apps to the Cloud? Mkay, you'll need to make a Project Plan for that, I get it...") - it's all output that's working towards a tangible goal.

You know what isn't output working towards a tangible goal? Scripting.

You know what you can't accurately do? Predict the future.

You know what the problem is here? Scripting a <task> that actually needs to be run, in future, with less individual repetitions of duration than just manually repeating the <task> would have taken you (and you get multiple lots of tangible output for that).

Let's give you a worked example; suppose you need to write a script to output all the IP Helper Addresses of a Cisco IOS Script, and (you don't know it yet), but you're not great with Bash Shell script (well, you do know that...), and it'll take end up taking you 16 hours. Sounds great; much easier than ripping through 500 devices and doing that manually, right - that'd take you maybe, I dunno, 5 hours and a bit of hand-cramp?

But what if I said to you that, unbeknown to you, we're about to swap out all that Cisco IOS kit for <SD-WAN Vendor XYZ> kit; where stuff like this (IP DHCP Relay Address) is pushed out in a programmatic, templated fashion anyway. What's happened now? Well, in Business Output terms, you've just wasted the time it would have taken to do it manually (5 hours) subtracted from the time it took you to script it and get the tangible output (16 hours), so you've cost the Business 11 hours of time you could have been doing something else productive.

Which would be 11 hours' worth of "Negative opportunity cost", and seems to be something the Automation Crowd rarely focus on; none of you are Mystic Meg; none of you have Crystal Balls; none of you can predict the future.

Something to dwell on. Meanwhile, I hope Princess Toadstool likes Hula Hoop crisps as engagement rings...

Disappearing DNS entries when your CNAME TTL differs from your PaaS Provider's

Saturday, 26 Jan 2019

The dreaded "Page can't be displayed" error

Most people in the field of IT or Networking will have seen this lovely Internet Explorer error, and immediately recognised their day was about to change course away from the schedule:

undefined

The why can vary massively; for this blog post, we'll look at one case in point - what happens when your DNS Time to Live (TTL) record, on your CNAME, doesn't match-up with your Platform as a Service (PaaS) provider's A Name. But first, a bit of background here - names changed to protect the innocent.

The Scenario with the PaaS Provider

We've got a Web Application that we've decided to farm-out to a PaaS Provider, which used to be on-premises (or "on-prem" for you cool Cloud Kids). It's very important to the Business, but for the purpose of technology employed it's nothing special - think a HTTPS Website, where the PaaS provider does DNS-based "Elastic" (boing!) Load Balancing - also known as GSLB, but the new Cloudy World has to re-invent the terms we're already used to... *grumble* *grumble*

Let's throw in some made-up pseudonyms to anonymise this a bit, and add some context:

  • My Employer (Enterprise Business, or "the Business")
    • Name - MyCompany Ltd
    • Main External URL/Domain - mycompany.com
    • Main Internal URL/Domain - prod.mycompany.uk
  • PaaS Provider
    • Name - PaaS Co. Ltd
    • Main PaaS URL/Domain - paasco.com
    • Cloud Environment Name - PaaSCloud
    • Use Load Balancers from - BigAssLoadBalancers (Vendor)

Because the Business (rightly) thinks that a new PaaS URL of https://bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com might not be as easy-to-remember as the old on-prem (yes, I'm trying to bait you with that phrase) one of https://appname.prod.mycompany.uk; and because we've got no choice about the PaaS URL, we've taken the decision to make a new sub-domain of *.paascloud.mycompany.com. While we're there, we think we'll sort out the outmoded concept of Internal (prod.mycompany.uk) vs External (mycompany.com) URLs, because this is all hosted off-prem anyway; so it's technically no longer part of our "internal" Domain.

Regardless of PasS Co, MyCompany uses Internal DNS that sits on Active Directory Domain Controllers; for the sake of ease, I'll call this "Internal DNS". MyCompany outsources it's Internet DMZ Data Centres to another MSP; we'll call them MSPCo. MSPCo's only relevance here is that they run our External DNS/Domain (from Internet-facing ns1.mspco.com DNS Servers), whereas we run our Internal DNS/Domain AD-DC DNS Servers. Or, in short:

  • MyCompany
    • Run Internal DNS Servers (i.e. pdc1.mycompany.uk) that are authoritative (but not advertised to Internet) for *.mycompany.uk
  • MSPCo
    • Run External DNS Servers (i.e. ns1.mspco.com) that are authoritative for *.mycompany.com

To give us an easy-to-remember FQDN for the AppName Web Application, we've setup the following which means it will be https://appname.paascloud.mycompany.com:

  • Sub-Domain Space (for all Apps on PaaS Co)
    • *.paascloud.my.company.com
  • Current PaaS Web App (one of the Apps on PaaS Co)
    • appname.paascloud.mycompany.com
  • Internal DNS (MyCompany, i.e. pdc1.mycompany.uk)
    • Authoritatively Resolve requests for *.prod.mycompany.uk
    • Conditional Forward requests for *.paascloud.mycompany.com to ns1.mspco.com
  • External DNS (MSPCo, i.e. ns1.mspco.com)
    • Authoritatively Resolve requests for *.paascloud.mycompany.com

The Problem with DNS Recursion

All that we've achieved above is a series of "forwarders", such that, for the worst case (Internal Client), they'll do this:

  1. Lookup appname.paascloud.mycompany.com against Internal AD-DC DNS (i.e. pdc1.mycompany.uk)
  2. Internal AD-DC DNS Condtional Forwards this to MSPCo External DNS (i.e. ns1.mspco.com)
  3. MSPCo External DNS (i.e. ns1.mspco.com) resolves this to a CNAME of bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
    1. MSPCo External DNS (i.e. ns1.mspco.com) then Recursively Resolves this against it's upstream DNS Provider (let's say dns1.bigisp.com)...
    2. ...Which queries the Root DNS Servers (i.e. a.root-servers.net), which tell it to ask the PaaS Co Authoritative DNS Servers (i.e. ns1.paasco.com) for the A Name associated with bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com...
    3. ...Which comes back from PaaS Co DNS Servers (i.e. ns1.paasco.com) as Public IP Address 203.0.113.234 (not real, check out RFC 5737 - IPv4 Address Blocks Reserved for Documentation)
  4. Internal AD-DC DNS replies back to the Internal Client, for a request of appname.paascloud.mycompany.com, with:
    1. (The CNAME) bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
    2. (The A Name) 203.0.113.234

Phew, there's a lot of steps eh? But at least we're out of the woods now, the client has the IPv4 Address it needs, so what's the "Page not Displayed" thing all about?

Pesky DNS TTLs

Here's the bit where the hierarchy of recursion in DNS starts to 1-up you, and the bad day kicks in - perhaps as known all-too-well by these graffiti artists:

undefined

Firstly, a caveat - all of the below may be different for your scenario, depending on how MSPCo DNS Recursion is/isn't setup.

If we make use of the lovely nslookup tool on Windows, here's what we can deduce for our good response (i.e. when the page actually displays, rather than the dreaded IE "Page not Displayed" error). Remember that pdc1.mycompany.uk is my Internal DNS Server (for this example anyway, in reality AD has a Parent/Child Regional Domain Controller hierarchy, so each Client uses a different AD-DC):

C:\Users\NervousAdmin>nslookup
> set debug
> server pdc1.mycompany.uk
<snip - goes off and resolves pdc1.mycompany.uk to IP 10.0.1.99>
> appname.paascloud.mycompany.com.
Server: pdc1.mycompany.uk
Address: 10.0.1.99

------------
Got answer:
 HEADER:
 opcode = QUERY, id = 24, rcode = NOERROR
 header flags: response, want recursion, recursion avail.
 questions = 1, answers = 2, authority records = 0, additional = 0

 QUESTIONS:
 appname.paascloud.mycompany.com, type = A, class = IN
 ANSWERS:
 -> appname.paascloud.mycompany.com
 canonical name = bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
 ttl = 7200 (2 hours)
 -> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com 
 internet address = 203.0.113.234
 ttl = 60 (1 min)
<snip>
------------
Name: bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
Address: 203.0.113.234
Aliases: appname.paascloud.mycompany.com

Given the response above is good (when everything is working), what does the above tell you? If we focus on the TTL sections, you'll see Windows has cached two responses here:

  1. appname.paascloud.mycompany.com -[CNAME]-> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com, cached for 7200 seconds (or 2 hours)
  2. bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com -[A Name]-> 203.0.113.234, cached for 60 seconds (1 min)

So what happens in 60 seconds, when that A Name expires then? Let's find out - the ">" shows you are within nslookup, so just hit the Up key, and Enter to re-lookup "appname.paascloud.mycompany.com." (as per prior posts, the appended dot means "just this exact FQDN, and no additional DNS Suffixes"), eventually you'll notice the bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com section goes to a TTL of 0:

> appname.paascloud.mycompany.com.
Server:  pdc1.mycompany.uk
Address:  10.0.1.99
<snip - only interested in the CNAME ttl section>
ANSWERS:
<snip>
-> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com 
 internet address = 203.0.113.234
 ttl = 0

But you'll notice your browser access to https://appname.paascloud.mycompany.com works fine during these tests; until you do the nslookup again, after the "ttl = 0" response. Now, there be dragons.

Uh-oh, where's my response gone?

When you refresh again, your heart will drop, your bum will tighten, your browser access to https://appname.paascloud.mycompany.com will stop working, and you'll see this:

C:\Users\NervousAdmin>nslookup
> set debug
> server pdc1.mycompany.uk
<snip - goes off and resolves pdc1.mycompany.uk to IP 10.0.1.99>
> appname.paascloud.mycompany.com.
Server:  pdc1.mycompany.uk
Address:  10.0.1.99

------------
Got answer:
    HEADER:
        opcode = QUERY, id = 28, rcode = NOERROR
        header flags:  response, want recursion, recursion avail.
        questions = 1,  answers = 1,  authority records = 0,  additional = 0

    QUESTIONS:
        appname.paascloud.mycompany.com, type = A, class = IN
    ANSWERS:
    ->  appname.paascloud.mycompany.com
        canonical name = bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
        ttl = 6926 (1 hour 55 mins 26 secs)

<snip>
------------
Name:    appname.paascloud.mycompany.com

Which will give you your dreaded "Page not Displayed friend", for exactly another 1 hour, 55 minutes and 26 seconds.

And how do I know that? Because that's what the TTL says that CNAME entry will stay in your cache for - regardless of the fact your Windows Client hasn't had a recursive response of the actual IP Address that it ultimately resolves to (203.0.113.234).

So what's the fix? Firstly, lets touch on DNS TTL. This isn't much different to IPv4 TTL; it just means that, once the TTL hits 0, the entry will be purged from your local DNS Cache. What happens next is the crucial part, dictated by the "DNS Response Hierarchy" your response had; if it's just a straight single-level hierarchy (i.e. domain.com -> 203.0.113.1), then your Client will go off and re-request the DNS Request to lookup domain.com to an IP Address.

But our case is different, and not in a good way - our "DNS Response Hierarchy" looks like this:

  1. (Parent) Fetch appname.paascloud.mycompany.com
    1. (Child) If you got here, now fetch bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com

But our TTL's look like this:

  1. (Parent) appname.paascloud.mycompany.com = TTL <bigger than "Child">
    1. (Child) bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com = TTL <smaller than "Parent">

That's not what we want at all; given these are two differing DNS Administrative Domains (owned and operated by two differing Companies - MSPCo for appname.paascloud.mycompany.com and PaaS Co for bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com), we (MyCompany) don't have any direct control over these. Regardless though, we need them to flip-it-around so that this happens:

  1. (Parent) appname.paascloud.mycompany.com = TTL <smaller (or same) than "Child">
    1. (Child) bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com = TTL <bigger than "Parent">

This way, when the "Parent" (initial, or root, or "actual FQDN I wanted the IP for") TTL expires, it will remove the "Child" (CNAME) entry with it; which means the DNS Lookup process will re-occur, and we'll happily get an IPv4 Address back. Technically simple, but you try and explain that to MSPCo and PaaS Co, and you'll find your "shouty voice TTL" quickly gets towards that precious 0...

Remotely changing the Management SVI on a Cisco 3524XL

Friday, 25 Jan 2019

A Cisco 35-what-what now?

You probably haven't heard of a Cisco 3524XL. You're possibly sat reading this thinking: "I've heard of the Nexus 3K, sure, but WTF is a 3520-Seires, am I behind already?". The answer is no, you aren't (or yes, you are if you're unfortunate enough to know what a C3524XL is) - but don't take my word for it, let's ask what Danny Dyer thinks:

undefined

Why are you blogging about a Cisco Switch that went EoL over a decade ago?

Indeed, the Cisco Catalyst 3524XL went End of Life in 2002 - far before I even started working in the field of Networking. So why am I talking about it here? Well, a few reasons:

  1. @DarrenFullwel challenged me to on Twitter
  2. It's got lessons to teach us all
  3. History needs to remind us that banging on the suffix "XL" should only be confined to fast food and t-shirts

Let's focus on what it can teach us - first, a little primer on my chief bugbear with it as a "capable Layer 3 Campus Access Switch".

The C3524XL only supports one SVI

That's not too bad you might think; you probably only want to give it a Management IP Address to the SVI, and let something more capable handle inter-VLAN Routing. But what happens when you want to do something like this:

  1. Remotely re-IP Address the Management IP (and the boss won't let you hire a van and take the day to drive to the arse-end of nowhere)
  2. Remotely change the configuration your colleague left with it using VLAN1 as the SVI, but everywhere else uses VLAN55 for Switch Management (and the boss still won't let you hire that van)

Any ideas on how you're going to sort that out, remotely? Let me introduce you to the age-old Network Engineering practice of...

Squeaky bum time

undefined

There's nothing for it, soldier; we've got two basic choices to do this remotely, and we're gonna need a stock of toilet roll for both:

  1. Use a SNMP-based config upload tool like Network Billy (coincidentally the finest thing to have come out of a GeoCities website)
  2. Use a TFTP-based config upload tool (like TFTPd32)
  3. Keep hassling the boss for that van

I went for option two, TFTP-based; but the basic concepts are the same. Firstly, we're going to double-check what we want to achieve; for my scenario, that's two things:

  1. Disable VLAN1
  2. Migrate the Management IP to VLAN55 (172.31.0.0/24)
    1. I'll also have to change this upstream, so that my L3 Default Gateway Switch/Router moves 172.31.0.0/24 from VLAN1 to VLAN55, or have both co-exist for a while and VRF Lite one VLAN off from the other; but that's for another blog post

To do this interactively, I'd want to do something like the following:

conf t
int vlan1
 no ip address
 no desc
 shut
vlan 55
 name Mgmt_VLAN
int vlan55
 desc Management VLAN
 ip address 172.31.0.99 255.255.255.0
 no shutdown
ip default-gateway 172.31.0.1
end
wr mem

But we don't have that luxury, so we'll go for a three-step approach.

Step 1 - The interactive bit

We need to setup the VLAN (just at Layer 2) ready to go; as we're talking about an archaic C3524XL, depending on the age of IOS on the Switch, that's either going to be the "new Cisco way" (as above), or if you're as unlucky as Dyer thinks, the old VLAN Database method, like this:

C3524XL#vlan database
vlan 55
exit

Regardless of which, we'll then check we've got the VLAN ready to go, and if necessary, add it to any 802.1q Trunk interfaces up to the Core (L3) Switch:

C3524XL#sh vlan id 55
C3524XL#sh int trunk | inc Span|Port|55

Now onward to the offline part.

Step 2 - The offline bit

Firstly, we need to grab the config file off the C3524XL. If you've got TFTPd32 running on your PC (which needs to be accessible from the existing C3524XL VLAN1 SVI IP Address, say your PC is 10.0.0.99), this is just a matter of turning TFTPd32 on, configuring it to a directory and ensuring Winblows Firewall isn't blocking inbound TFTP (UDP/69). Then login to your C3524XL, and do something like this to copy the config from the Switch to your PC:

C3524XL#copy run tftp://10.0.0.99/c3524xl-confg
yes

Now you have the file locally, we'll be editing it in a text editor to make the changes above, and turn it into the startup-config (for the sake of space, I'm only showing the changed lines; the rest of the config needs to be there, you are only Find-Replacing these sections):

<snip - rest of config removed, but would be there>
hostname C3524XL
<snip - rest of config removed, but would be there>
int vlan1
 no ip address
 no desc
 shut
int vlan55
 desc Management VLAN
 ip address 172.31.0.99 255.255.255.0
 no shutdown
<snip - rest of config removed, but would be there>
ip default-gateway 172.31.0.1
<snip - rest of config removed, but would be there>

A few handy hints here:

  • Make sure all your interconnect, Trunks and Management SVI VLAN55 are set to "no shutdown"
  • Triple-check that in your scenario it is actually VLAN 55 for Management; the IP Address is correct and doesn't conflict & VLAN55 exists and would be allowed on the Trunk

Nothing left now but to execute our actions and make rocket go now!

Step 3 - The bit you make a calming brew beforehand for

Now it's crunch time. You've obviously got an RFC Change Request that's approved to do this (because you wouldn't "Lab on Live", would you?), so what's to fear, eh?

Firstly, we upload the amended config file, straight into startup-config:

C3524XL#copy tftp://10.0.0.99/c3524xl-startup.txt startup-config

Then we get paranoid and double-check it copied everything correctly, that we're definitely Trunking that VLAN55 and we've set the Management VLAN 55 to "no shut":

C3524XL#sh start
C3524XL#sh vlan id 55
C3524XL#sh int trunk | inc Span|Port|55

And finally we sup-up that brew, clench the derriere, and invoke the outage-causing Management IP switchover:

C3524XL#reload
yes

Then we wait, and nervously set our local PC Command Prompt "ping-t" going, waiting for it to pop back up with the new Management IP address:

C:\Users\NervousAdmin>ping -t 172.31.0.99

Pinging 172.31.0.99 with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
<2-3 nervous minutes later>
Reply for 172.31.0.99: bytes=32 time=13ms TTL=64
Reply for 172.31.0.99: bytes=32 time=13ms TTL=64
[CTRL+C]

Wrapping it up

And there we go; remotely changing the Management VLAN and IP Address of a Switch that's older than time - and hopefully a useful tip if you have a similar single-SVI-only piece of sh... kit. Enjoy!

Using Intel vPro AMT ME as a poor man's iLO for KVM

Monday, 21 Jan 2019

Got Intel vPro AMT ME, bruv?

Recently I've been trying and failing to get Nutanix Community Edition (CE) to cluster-up, with one ESXi-nested virtualised AHV/CVM and another physical AHV/CVM, running on an old HP Elite 8200 Small Form Factor Desktop PC. If you've played around with Nutanix, you'll know there's a lot of tinkering with the Host (Acropolis Hypervisor, AHV) Node to install the Controller Virtual Machine (CVM), and a bit of rebootery required; if you've been following this blog long, you'll realise that I'm not favoured with the Technology Gods - and my mileage often varies into many more reboots than the average bear.

When you're working with a frankenmachine (ProTip - Buy a 13-pin male Mini-SATA to 22-pin female SATA Converter to use the proprietary MicroSATA/Power Cable going into the CD Drive for an SSD), which you've put in your upstairs LAN Room, then the frequent trips up and down, and lugging a keyboard, video and mouse can get, well, annoying. Unless, that is, you've got Intel vPro, Active Management Technology (AMT) or Management Engine (ME) onboard your lovely business-class Laptop or PC - and then you can use Intel's AMT VNC Server.

BIOS Time - Setting it up

Note - Most of the first part of this is the same as the How-to Geek article on How to Remotely Control Your PC with some added time-saving, hair-tearing-out tips to follow later.

As with all good things in life (with PC hardware), the fun stuff happens in the BIOS. As per the links above, this is fairly simple:

  1. Take your old school keyboard, video and mouse (or USB Crash Cart KVM Adapter, if Christmas time has just been) and plug them into your vPro/AMT/ME-enabled Desktop or Laptop (well, not Laptop, obviously because it's got a keyb... never mind)
  2. Reboot
  3. Furiously tap Ctrl + P to get into the Intel ME Settings BIOS
  4. When asked for a password, unless you set it, it will be "admin" (without the speech marks)
  5. Enter "ME General Settings", and
    1. Change the password to something more secure (it'll need to be at least one capital letter, one number and one special character)
    2. Setup the Network IP for AMT - think of this the same as an iLO/iDRAC/BMC, you can either "Share" the Host OS's one (but why, as you're tied into that), or set a seperate, dedicated IP for just AMT Keyboard Video Mouse (KVM) access
    3. Hit Enter and OK on "Active Network Access" (or this was all for nought)
    4. Configure the DNS-related Hostname, DNS Server and related settings (maybe something like amt-<PC_Hostname>, so you can distinguish the two in your DNS later on)
  6. Enter "AMT Configuration", and
    1. Enable the "Manageability Feature Selection"
    2. Enable "SOL" (Serial-over-LAN)
    3. Enable "IDER" (ISO/Image Remote Booting)
    4. Enable "Legacy Redirection Mode" (By Legacy they mean "Using something sensible like VNC Viewer, rather than crappy Intel-proprietary KVM Viewers)
    5. Enable "KVM Feature Selection"
    6. Disable "User Opt-in"
      1. If you leave it enabled, the non-existent person in front of the real keyboard/video/mouse that you plugged in will have to type a challenge/response string to allow you in, which defeats the point
    7. Enable "Opt-in configurable from Remote IT"
      1. For when you sit back at your desk, and realise you didn't do the step above
    8. Escape/Escape/Escape/Yes/Save/OK

Now we've setup most of it, what can we do?

Stage 1 - The ME Web GUI

Now you've done all that BIOS work, here comes the first payoff - a lovely Web User Interface you can access via http://<AMT_IP_ADDRESS>:16992, as per example below (my AMT IP is 10.0.0.12):

  undefined

The kind of information you get to see here includes:

  • System Information
    • Model, BIOS, Firmware etc.
    • undefined
  • Memory Information
    • Type, Number of DIMMs, Size etc.
    • undefined
  • Disk Information
    • Type, Size, Manufacturer etc.
    • undefined
  • Event Logs
    • Last Power, Last Crash, Case Opened etc.
    • undefined

Then there's the juicy ones that you literally don't want (or have) to leave your chair for any more:

  • Remote Power On/Off/Reboot
    • Including "Next Boot" actions (i.e. Boot to USB, Boot to BIOS etc)
    • undefined

Stage 2 - But Ma, where's my KVM?

If you've read this far, you're probably thinking you've been short-changed here; I promised you a KVM and I've delivered you a fancy Web GUI. So here's the fun part; you'll need one of the following to actually enable the VNC-based KVM functionality to work:

  1. (Windows App) MeshCommander
  2. (Windows App) Intel Manageability Commander
  3. (Windows SDK) Download Intel SDK, extract it some place and execute "KVMControlApplication.exe" (hiding away under the "Windows", and then "bin" directories (ProTip - You'll need to install Microsoft dotNET for this, so get a brew break ready), and you can then "Edit Machine Settings", login with "admin" and the <AMT_PASSWORD> you set earlier, and click "Machine Settings", then "Enabled - all ports" - as described in this lovely blog post

Regardless of which you chose, here's a big tip - the "RFB Password" has to be exactly 8 characters, and include at least one each of the following:

  • A capital letter
  • A number
  • A special character (i.e. @,'| etc.)

That tip right there saved you two hours of Googling "Error 400" and "XML invalid", and - my personal favourite - "KVM no respond" errors.

You can also do this from within MeshCommander, you click on the following sections, and then you'll get a prompt to chose the KVM "Enabled - all ports" and "RFB Password" (Intel-speak for "VNC Login Password")

undefined

Stage 3 - Look Ma, no hands(-eyes engineer lugging his ass upstairs)!

Once done, you can now use a standard VNC Client* to connect via <AMT_IP_ADDRESS>:5900 the same you would with any other standard VNC Server:

* = On Windows, only RealVNC seemed to work. On Mac OS X, only VNC Viewer seemed to work. On Linux (Debian), only Remmina seemed to work.

undefined

You'll then be prompted for the VNC Password (this is the pesky 8-character RFB Password):

undefined

And finally given a lovely KVM VNC session into your vPro-enabled PC or Laptop:

undefined

Et voila - the poor man's iDRAC/iLO/CIMC/<BMC acronym of choice here> is complete!

Note, if you have a Windows PC and don't want to enable the VNC (TCP/5900) part, then both MeshCommander and Intel Manageability Commander have a built-in, non-VNC KVM Client, which seems to speak some magical SOL/IDER "backdoor" protocol into the AMT chip, so they always work, regardless of you turning on/off the "Legacy ports" settings.

When BGP AS-Override goes the wrong way

Sunday, 13 Jan 2019

BGP AS-Override

Much like my post on when BGP SoO goes the wrong way, I seem to have a problem with directionality of commands on Cisco IOS - this time, with BGP AS-Override. I came across this in an Enterprise Network (the same kind where we say "MPLS" but actually mean "IP VPN we buy from someone else"), where the ISP we used had an offering they called "Shared Access" - which basically means they'll let you hook an Access Circuit into someone else's IP VPN/VRF with them, as long as you, the ISP and the "VRF Owning Company" co-sign an agreement saying it's allowed.

Why might you want to do this? Think along the lines of Extranets, and furthering the idea that "Everything is just a Line Card" across Company boundaries; particularly useful if you work in the Large Enterprise and Public Sector space, as here there are often strange agreements where multiple Managed Service Providers (MSPs), Systems Integrators (SIs) and sometimes even Service Providers (SPs) (reluctantly) come together to offer a common "Service" back to either the General Public, or perhaps some large Industry Sector. Regardless of the why, the problem is normally the same old BGP-over-VRF limitation - if you use the same ISP for multiple IP VPNs/VRFs, and have end-to-end BGP reachability, BGP doesn't know to turn off it's split-horizon-based-on-ASN functionality; because it just sees the same ASN twice in the AS_PATH, rather than "knowing" that the AS_PATH consists of two differing VRFs/Routing Domains.

The Scenario Topology

undefined

This is the Scenario Network Topology, showing:

  • 2x My Network MPLS CE Network Customer Edge (CE) Router
  • 4x MPLS SP Network Provider Edge (PE) Routers
    • 2x Connected to My Company Network IP VPN/VRF VPNN123456
    • 2x Connected to Other Company Network IP VPN/VRF VPNN654321
  • 2x eBGP Peering from My Company Network < -> SP MPLS PE Router, connected to My Company IP VPN/VRF VPNN123456
  • 2x eBGP Peering from Other Company Network <-> SP MPLS PE Router, connected to Other Company IP VPN/VRF VPNN654321
    • 1x "Foreign Network" CE Router @ My Company Data Centre
  • AS-Override applied on My Network MPLS CE Network (CE) Router (towards My Company IP VPN/VRF VPNN123456)
    • Note that I am "piggy-in-the-middle"

Some notes on SP Terminology

As some of this is specific to using a Third Party SP's MPLS Network, through a "wires-only" IP VPN offering - here's a quick primer on some terminology I'm using, as this will differ between varying SP's:

  • "wires-only" - Means the SP drops a NTE/NTU in My Company's Premises, to which I attach my self-managed CE Router
    • The SP does not manage any of CE Router; I eBGP Peer direct from a Private ASN to the SP's Public ASN (or whatever they use)
    • I'm told this model is more popular in the USA than Europe (but I'm in the UK, so there are exceptions to the rule...)
  • VPNNxxxxxx - The SP-allocated IP VPN/VRF Identifier, so that they can differentiate between their various Customers (they could name their VRF instances by Company Name, but what happens when the Company changes name, or two different Companies have the same/similar names...)
  • ASN Numbers - Those on the left-hand side are My Network ones; those on the right-hand side are "Foreign" (Other Company Network) ones
    • Just like between IPsec Encryption Domains, it's a good idea to make sure these don't conflict (tricky when everyone is using the same Private BGP ASN Range)
    • It is the same Core ASN/PE-CE Peering ASN that the SP uses for all Customers
  • CE Devices - I am the Customer (or one of two), and not the SP here; I have no visibility or access to any of the PE's in this topology
    • This is a very different slant to most write-ups and blog posts I've read on the matter; everyone seems to work for an SP bar me!
  • AS-Override - This is applied at My Company end only; the "Foreign" Company are not performing AS-Override
    • So the AS_PATH they "advertise" to me contains the raw SP ASN for their own CE-PE Peering,Their CE1 <->  PE2 and Their CE15 <-> PE66

What I thought would happen

Caveat - apparently, Cisco IOS doesn't let you use AS-Override in the Global Routing Table (GRT, y'know, the one that's not in an "address-family" command); but it sometimes does (worked on my ASR1K's), and that's not the point of this post.

Focussing on My Company Data Centre - and ignoring the "Southbound" eBGP Peering from this DC into MPLS IP VPN/VRF VPNN123456 - here's an example of the Prefix I'm looking at, received from "Foreign" Company:

CE1#172.31.0.0/24 via <DC-Router1>, AS_PATH: 65007 1234 64999

Now, if we look at the "Southbound" eBGP Peering towards My Company IP VPN/VRF VPNN123456, I want to re-advertise "Foreign" Company Prefix 172.31.0.0/24 onward, via VPNN123456, into My Company Other Campus DEF (bottom-right). Given the "as-override" command is applied towards the SP's PE Router, I expected the "find-and-replace" operation to work in a similar (outbound) manner. That is, for this configuration on my CE1 Router @ My Company Network, Data Centre ABC:

CE1#
router bgp 65432
 neighbor 192.168.0.1 remote-as 1234
 neighbor 192.168.0.1 as-override

I thought my CE1 Router would therefore rewrite it's own AS65432 (Local ASN, CE1 Router) with the SP's AS1234 (Foreign ASN, CE1 Router perspective) - so an AS_PATH that actually looks like this, to the downstream PE1 (and any other Routers) on VPNN123456:

PE1(VRF "VPNN123456") or CE99#172.31.0.0/24 via 192.168.0.2, AS_PATH: 65432 65439 65007 1234 64999

 ...but that's not how AS-Override works here.

What actually happens

It transpires the "find-and-replace" behaviour isn't working with the "find" parameter I think it is. If I use some colouring here, this will be easier to see. If we show the entire AS_PATH (including the Routers at either end, which you normally wouldn't see in BGP outputs), here's what you've got for Prefix 172.31.0.0/24 going all the way to CE1 @ My Company Data Centre ABC:

  • 64999 1234 65007 65439 65432 1234 65430

I appreciate this runs inverse/reverse to the AS_PATH that CE1 actually sees; but bear with my incorrect directional thinking here. So the part I'm focusing in on is between CE1 <-> PE1, or this part:

  • ...65432 1234...

At this point, in my head, I'm thinking "The neighbour command is applied outbound to the 192.168.0.1 SP PE1 peering, so it must use this relationship in the find-replace activity", so I'm thinking, after the AS-Override rewrite, it looks like this:

  • 64999 1234 65007 65439 65432 65432 65430

Here's the kicker

The reality is that AS-Override doesn't care about eBGP Peering relationships; it acts as a dumb "find-replace" algorithm, but it uses the eBGP Peering configuration to get it's "find" parameter, by looking at the ASN value after the "remote-as" command, so here for CE1:

  • router bgp 65432
     
    neighbor 192.168.0.1 remote-as 1234

What it then "dumbly" does is looks at the entire AS_PATH it already has, and simply replaces the <REMOTE_AS> value with it's <LOCAL_AS>, before "advertising" this out, so for CE1 it would do this instead:

  • 64999 65432 65007 65439 65432 65432 65430

Which completely broke my thinking, as I hadn't appreciated that a downstream Router could overwrite an AS_PATH entry that happened much earlier-on in the formation of the AS_PATH (i.e. for a Peering Association it wasn't involved in, so how could it dare overwrite anything to do with that?).

So what next

For the example given, we actually ended up moving all this entirely, such that we had a PE-like Router where we could control ingress/egress into both IP VPNs (and AS-Override in both directions, between both IP VPNs) - but this isn't always possible. Technologically, it's easy to look dismissively at the Scenario Topology; but if you step back a bit, you appreciate our hand was forced. As I described earlier, this is a politically complex setup, with various MSPs and SIs - and as you can see, although CE15 sits in "our" DC (actually an MSP, but anyway...), it's actually a CE Router of our "Foreign" (think Extranet) Company's IP VPN (VPNN654321); which they just so happen to have with the same SP that we have Our Company IP VPN (VPNN123456) with.

Sure, this isn't a great place to be - but (in that time-honoured phrase), "It is, what it is"; looking longingly at CCNP and CCIE Greenfield Exam Topologies isn't making this self-rectify. We were fortunate because we had the capability to entirely redesign this (something for another blog post), but if we hadn't, there's a whole manner of constraints here causing pain, such as:

  • SP won't let us reconfigure their PEs on either IP VPN/VRF (so no quick-win "Bang AS-Override on PE66 and PE1" for you)
  • Commercials mean we can't collapse-out the CE15 <-> PE66 arrangement
  • CE1 / Data Centre ABC doesn't just exist for this flow (so no quick-win "Bang the VPNN123456 eBGP Peering into a VRF Lite instance, instead of the GRT"

What's the point then?

Ignoring the goal of getting this working, this was a useful real-world exercise, as it taught me:

  1. BGP AS-Override is dumb, and will quite happily assume the <REMOTE_AS> to <LOCAL_AS> Peering is the only one that contains the <REMOTE_AS>, which couldn't possibly already be in the AS_PATH
  2. BGP is not VRF-aware; it's rules of split-horizon are there to annoy me and rob me of sleep
  3. Stop reading "neighbor" commands and assuming they imply the directionality of the thing they are doing
  4. Googling for issues like this throws up limited results, because everyone else seems to be able to access the SP PE Routers
  5. I need to flip-round the way I think about AS_PATH as "Destination-to-Source" rather than "Source-to-Destination"
Home ← Older posts