pwd.re

A Journey Into BGP-ORR

Published on .

Ever since I first heard of BGP-ORR (or BGP Optimal Route Reflection, RFC9107) some years ago I’ve been nerdingly excited about it. Recently at my job I looked at the possibility of moving from in-path BGP route reflectors (which is in the forwarding path) to out-of-path route reflectors (which is not in the forwarding path, allowing for a BGP free core) and figured BGP-ORR would be a good way of accomplishing a more centralized route reflector setup.

While there are a lot of good resources available about what BGP-ORR does and how it works I still had some questions. So, I brought up a small lab environment using Cisco CML and some Cisco XRv devices. I know Arista EOS also has support for BGP-ORR and I assume JunOS also has it, and as I write this there’s an open issue for FRRouting.

This is the layout of the lab network, not conceptually far away from how parts of the network I run at my job is built:

8 routers in a network setup. p1, p2, cr1 and cr2 is in a ring. pe1 and pe2 is connected directly to cr1 and cr2 respectivly while pe3 only is connected to pe2. rr is connected to p2.

All routers are running IS-IS as the IGP, with Segment Routing (not SRv6) configured.

I’m not going to show the full configuration from each and every router (you can find that here), but instead just show a small and relevant subset to get the network running. This is all before BGP-ORR has been configured.

Interfaces

Nothing fancy, just a Loopback interface and unnumbered links between all routers so I don’t have to care about linknets. CDP for making sure I don’t mess up what ports go where :)

interface Loopback0
 ipv4 address 10.13.37.x 255.255.255.255
!
interface GigabitEthernet0/0/0/0
 description Link-xxxx
 cdp
 ipv4 point-to-point
 ipv4 unnumbered Loopback0
 no shutdown
!

IS-IS and Segment Routing

net and prefix-sid are based on the loopback address. router-id is not strictly necessary for getting IS-IS to run but is needed later on. I put a default metric cost of 10 on all links just to make it easier for myself.

router isis LAB
 is-type level-2-only
 net 49.000.0100.1303.700x.00
 address-family ipv4 unicast
  metric-style wide
  metric 10
  router-id Loopback0
  segment-routing mpls
 !
 interface Loopback0
  passive
  address-family ipv4 unicast
   prefix-sid index x
  !
 !
 interface GigabitEthernet0/0/0/0
  point-to-point
  address-family ipv4 unicast
  !
 !
 interface GigabitEthernet0/0/0/1
  point-to-point
  address-family ipv4 unicast
  !
 !
!
segment-routing
 global-block 16000 23999
!

The segment routing global-block is the standard Cisco block.

BGP (core)

A static route is created and advertised for the default route.

router static
 address-family ipv4 unicast
  0.0.0.0/0 Null0
 !
!
router bgp 65534
 bgp router-id 10.13.37.x
 default-information originate
 address-family ipv4 unicast
  redistribute static
 !
 session-group iBGP
  remote-as 65534
  timers 10 32
  update-source Loopback0
 !
 neighbor 10.13.37.255
  use session-group iBGP
  description Route Reflector
  address-family ipv4 unicast
  !
 !
!

BGP (PE)

router bgp 65534
 bgp router-id 10.13.37.x
 address-family ipv4 unicast
 !
 session-group iBGP
  remote-as 65534
  timers 10 32
  update-source Loopback0
 !
 neighbor 10.13.37.255
  use session-group iBGP
  description Route Reflector
  address-family ipv4 unicast
  !
 !
!

BGP (RR)

router bgp 65534
 bgp router-id 10.13.37.255
 address-family ipv4 unicast
 !
 session-group iBGP
  remote-as 65534
  timers 10 32
  update-source Loopback0
 !
 neighbor-group RR-CLIENTS
  use session-group iBGP
  address-family ipv4 unicast
   route-reflector-client
  !
 !
 neighbor 10.13.37.1
  use neighbor-group RR-CLIENTS
  description cr1
 !
 neighbor 10.13.37.2
  use neighbor-group RR-CLIENTS
  description cr2
 !
 neighbor 10.13.37.5
  use neighbor-group RR-CLIENTS
  description pe1
 !
  neighbor 10.13.37.5
  use neighbor-group RR-CLIENTS
  description pe1
 !
 neighbor 10.13.37.6
  use neighbor-group RR-CLIENTS
  description pe2
 !
 neighbor 10.13.37.7
  use neighbor-group RR-CLIENTS
  description pe3
 !
!

The full, finished configuration and CML lab setup file is on Github.

Some background and theory

What is BGP-ORR, short version

BGP-ORR is a BGP feature which allows a BGP Route Reflector to send a Route Reflector Client the best path based on the perspective of the client, and not the Route Reflector itself. This is accomplished by having the Route Reflector know about the IGP topology, and based on this data do SPF (Shortest Path First) calculations from IGP locations other than the Route Reflector.

What problem does it solve?

(Feel free to skip this if you’re already familiar with Route Reflectors and the inherent problems that come with them.)

Let’s take a look at our small lab network. It wouldn’t be a problem to configure each PE router to have iBGP sessions with the core routers and have the core routers act as route reflectors, reflecting the prefixes learnt from each PE while also advertise default routes (to make sure each PE can reach the rest of the network and the Internet).

This however gets very tedious, very fast. In a larger network with many PE routers there’s a lot of sessions to configure and a lot of churn on the core routers.

There’s also the money part to think about. Depending on how the network is designed the core routers might have to be able to consume a full BGP table (IPv4 and IPv6) and as I write this we’re at around 950k IPv4 prefixes and 170k IPv6 prefixes. Those are non-trivial numbers and requires some beefy hardware.1

From a network design perspective it would also be a lot easier (and cheaper) to replace the beefy core routers with pure LSR devices which just forward packets from one port to another.

That’s where a dedicated Route Reflector comes into play.

The Route Reflector (or just RR) is a dedicated router (either physical or virtual) whose job is to have iBGP sessions with each and every other router in the network and then reflect routes to other routers (according to configured policies). “Reflect routes” in this context means that the RR will send routes learnt from one iBGP neighbor to other iBGP neighbors - something that BGP normally doesn’t do and also requires the route-reflector-client configuration on the neighbor.

That’s it. Besides the extra configuration necessary for the neighbors there’s nothing special about a route reflector. And that’s the problem.

Let’s look at the lab network again.

Both cr1 and cr2 send a default-route to the RR. This is what the BGP table in the RR looks like:

RP/0/0/CPU0:rr#show bgp ipv4 unicast
Status codes: s suppressed, d damped, h history, * valid, > best
              i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
   Network            Next Hop            Metric LocPrf Weight Path
* i0.0.0.0/0          10.13.37.1               0    100      0 ?
*>i                   10.13.37.2               0    100      0 ?

Processed 1 prefixes, 2 paths

It picks one of the default routes as the best and this is the route that will be advertised to all neighbors. This is the reason:

RP/0/0/CPU0:rr#show bgp ipv4 unicast 0.0.0.0/0 bestpath-compare
BGP routing table entry for 0.0.0.0/0
Versions:
  Process           bRIB/RIB  SendTblVer
  Speaker                  6           6
Last Modified: Jan 17 11:52:41.481 for 00:01:26
Paths: (2 available, best #2)
  Advertised to update-groups (with more than one peer):
    0.2
  Path #1: Received by speaker 0
  Not advertised to any peer
  Local, (Received from a RR-client)
    10.13.37.1 (metric 30) from 10.13.37.1 (10.13.37.1)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Received Path ID 0, Local Path ID 0, version 0
      Higher IGP metric than best path (path #2)
  Path #2: Received by speaker 0
  Advertised to update-groups (with more than one peer):
    0.2
  Local, (Received from a RR-client)
    10.13.37.2 (metric 20) from 10.13.37.2 (10.13.37.2)
      Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best
      Received Path ID 0, Local Path ID 0, version 6
      best of local AS, Overall best

The route from 10.13.37.1 (cr1) has a higher IGP metric than the route from cr2, which is expected.

And also as expected, cr2 is advertised as the best route:

RP/0/0/CPU0:rr#show bgp ipv4 unicast neighbors 10.13.37.6 advertised-routes
Network            Next Hop        From            AS Path
0.0.0.0/0          10.13.37.2      10.13.37.2      ?

Processed 1 prefixes, 1 paths

In our small network with uniform IGP metrics on all links this isn’t an issue but this is rarely how the real world looks. We might want to have pe1 push its traffic towards cr1 by default, and pe2/pe3 against cr2 (due to for instance real-world distance). Then we would want to have the route reflector send the default route which has the lowest IGP metric, from the PE router perspective.

As our setup is right now this isn’t possible since the route reflector will send the route which is best from the perspective of the route reflector. This is the problem BGP-ORR solves, since the routes sent is based on the IGP metric seen from each PE.

What do I want to find out?

Let’s get technical

First let’s make a small topology change. The IGP metric is increased to 100 between cr1 and pe2, and cr2 and pe1.

The basics

Before anything we must allow topology data from our IGP (IS-IS) to be re-distributed into BGP. This is done using this command on the route reflector:

router isis LAB
 distribute link-state
!

OSPF has a the same command but there’s a caveat you should be aware of. If running IS-IS you must also specify the router-id in the configuration, otherwise the router won’t calcuate the topology (ask me how I know…).

The next step is to configure an ORR group. It’s possible to have 32 groups (at least on IOS-XR and Arista EOS) and each group can have up to three IP addresses (also called root nodes) specified which the SPF calculates will be based on. I’m not sure when you would want to specify more than one, perhaps in a scenario where the primary router goes away and the route reflector needs to re-calculate based on a secondary router?

Anyway, we’ll start easy by creating one group per PE router:

router bgp 65534
 address-family ipv4 unicast
  optimal-route-reflection pe1 10.13.37.5
  optimal-route-reflection pe2 10.13.37.6
  optimal-route-reflection pe3 10.13.37.7
 !
!

Now we can take a look at the ORR database:

RP/0/0/CPU0:rr#show orrspf database pe1
ORR policy: pe1, IPv4, RIB tableid: 0xe0000010
Configured root: primary: 10.13.37.5, secondary: NULL, tertiary: NULL
Actual Root: 10.13.37.5, Root node: 0100.1303.7005.0000

Prefix                                        Cost
10.13.37.1/32                                 10
10.13.37.2/32                                 20
10.13.37.3/32                                 20
10.13.37.4/32                                 30
10.13.37.5/32                                 0
10.13.37.6/32                                 20
10.13.37.7/32                                 30
10.13.37.255/32                               40

Number of mapping entries: 9

RP/0/0/CPU0:rr#show orrspf database pe2
ORR policy: pe2, IPv4, RIB tableid: 0xe0000011
Configured root: primary: 10.13.37.6, secondary: NULL, tertiary: NULL
Actual Root: 10.13.37.6, Root node: 0100.1303.7006.0000

Prefix                                        Cost
10.13.37.1/32                                 20
10.13.37.2/32                                 10
10.13.37.3/32                                 30
10.13.37.4/32                                 20
10.13.37.5/32                                 30
10.13.37.6/32                                 0
10.13.37.7/32                                 10
10.13.37.255/32                               30

Number of mapping entries: 9

Looks reasonable. Let’s change the metric between cr1 and pe1 to 1000 and see how the database changes:

RP/0/0/CPU0:rr#show orrspf database pe1
ORR policy: pe1, IPv4, RIB tableid: 0xe0000010
Configured root: primary: 10.13.37.5, secondary: NULL, tertiary: NULL
Actual Root: 10.13.37.5, Root node: 0100.1303.7005.0000

Prefix                                        Cost
10.13.37.1/32                                 110
10.13.37.2/32                                 100
10.13.37.3/32                                 120
10.13.37.4/32                                 110
10.13.37.5/32                                 0
10.13.37.6/32                                 110
10.13.37.7/32                                 120
10.13.37.255/32                               120

Number of mapping entries: 9

(If you’re playing along at home you should be prepared that changes in the network take some time to propagate fully into the ORR database, 15-20 seconds at least)

Since the IGP metric between pe1 and cr2 is 100 it’s fully expected to see a cost increase with 100. Let’s change the cr1-pe1 cost back to 10 again.

Now we have verified that the route reflector is “cost aware”, and thus can do SPF calculations based on each client (PE router). To actually enable this we need to activate it on each neighbor (or neighbor-group):

router bgp 65534
 neighbor 10.13.37.5
  address-family ipv4 unicast
   optimal-route-reflection pe1
  !
 !
 neighbor 10.13.37.6
  address-family ipv4 unicast
   optimal-route-reflection pe2
  !
 !
 neighbor 10.13.37.7
  address-family ipv4 unicast
   optimal-route-reflection pe3
  !
 !
!
end

Let’s look at the advertised routes:

RP/0/0/CPU0:rr#show bgp ipv4 unicast neighbors 10.13.37.5 advertised-routes
Network            Next Hop        From            AS Path
0.0.0.0/0          10.13.37.1      10.13.37.1      ?

RP/0/0/CPU0:rr#show bgp ipv4 unicast neighbors 10.13.37.6 advertised-routes
Network            Next Hop        From            AS Path
0.0.0.0/0          10.13.37.2      10.13.37.2      ?

RP/0/0/CPU0:rr#show bgp ipv4 unicast neighbors 10.13.37.7 advertised-routes
Network            Next Hop        From            AS Path
0.0.0.0/0          10.13.37.2      10.13.37.2      ?

If we go back and look at the output from show orrspf database this is correct. pe1 (10.13.37.5) has a lower metric towards cr1 (10.13.37.1) and pe2/pe2 (10.13.37.6/10.13.37.7) has a lower metric towards cr2 (10.13.37.2). Let’s increase the metric between cr1 and pe1 again and see what changes:

RP/0/0/CPU0:rr#show bgp ipv4 unicast neighbors 10.13.37.5 advertised-routes
Network            Next Hop        From            AS Path
0.0.0.0/0          10.13.37.2      10.13.37.2      ?

Now the route reflector is advertising the default route cr2, since the cost to cr2 is lower.

The problem we had before is now gone since our route reflector is advertising not what it sees as best but rather what’s best for the client!

Let’s keep trying things out.

Do we need a unique ORR group for each router in our network?

No.

It’s perfectly fine to use the pe2 ORR group for the pe3 router, since it’s single-homed towards pe2 and probably should get the same routes.

router bgp 65534
 neighbor 10.13.37.7
  address-family ipv4 unicast
   no optimal-route-reflection pe3
   optimal-route-reflection pe2
  !
 !
!
end

Do the CR devices need to speak BGP?

No.

Let’s shutdown the BGP-session in cr1 so cr2 will be the only router advertising a default route:

router bgp 65534
 neighbor 10.13.37.255
  no shutdown
 !
!

The route reflector will now advertise the default route from cr2 towars all PE devices but the next-hop is still based on IGP metric:

RP/0/0/CPU0:pe1#sh route | i 10.13.37.2
Gateway of last resort is 10.13.37.2 to network 0.0.0.0
B*    0.0.0.0/0 [200/0] via 10.13.37.2, 00:02:59
i L2 10.13.37.2/32 [115/20] via 10.13.37.1, 00:11:00, GigabitEthernet0/0/0/0

Just as expected.

Scaling up

I briefly mentioned a limit of 32 ORR groups. This specific number isn’t specified in the RFC but is a limit in both IOS-XR and EOS. Practically speaking this means that if we have more than 32 routers we need to think about our setup.

Having more virtual route reflectors is of course an option, where each reflector could serve a specific part of the network, but I want to find out what happens if I configure the core routers (cr1 and cr2) as root nodes in my ORR group and then use that group for all of my PE routers connected to those core routers - i.e. have one ORR group per region in my network.

Let’s create a new ORR group:

router bgp 65534
 address-family ipv4 unicast
  optimal-route-reflection core 10.13.37.1 10.13.37.2
 !
 neighbor 10.13.37.5
  address-family ipv4 unicast
   no optimal-route-reflection pe1
   optimal-route-reflection core
  !
 !
 neighbor 10.13.37.6
  address-family ipv4 unicast
   no optimal-route-reflection pe2
   optimal-route-reflection core
  !
 !
 neighbor 10.13.37.7
  address-family ipv4 unicast
   no optimal-route-reflection pe2
   optimal-route-reflection core
  !
 !
!

ORRSPF database output:

RP/0/0/CPU0:rr#show orrspf database detail
ORR policy: core, IPv4, RIB tableid: 0xe0000013
Configured root: primary: 10.13.37.1, secondary: 10.13.37.2, tertiary: NULL
Actual Root: 10.13.37.1, Root node: 0100.1303.7001.0000

Prefix                                        Cost
10.13.37.1/32                                 0
10.13.37.2/32                                 10
10.13.37.3/32                                 10
10.13.37.4/32                                 20
10.13.37.5/32                                 10
10.13.37.6/32                                 10
10.13.37.7/32                                 20
10.13.37.255/32                               30

Number of mapping entries: 9

Looking at the output it’s probably not that hard to figure out what the route reflector will advertise (although it did take me a while to figure out what was happening).

The ORR group has the primary root node 10.13.37.1, which is also advertising a default route. So the route reflector will advertise the route from 10.13.37.1, because the cost is 0. Can’t get lower than that!2 So no matter the IGP metric the PE routers will always receive a default route from 10.13.37.1 (cr1).

This may or may not be a problem but I wouldn’t want to deploy this in a production network. I just feel it that somewhere down the line there’s sub-optimal routing or even a loop just waiting to happen :)

Before I continue I would like to point towards the presentation Modern BGP Design and how the author brings up the possibility of using ORR and add-path together.


Really quick about BGP add-path. By default a BGP speaker only advertises what it think is the best path - as in a single path. Add-path is an extra capability which enables a BGP speaker to send and/or receive more paths, or additional paths. It’s then up to the router to decide what to do with the extra path(s); discard, install as a backup path or something else.

As it is a capability there’s no guarantee that all devices support it and in that case only the best path will be advertised and we’re back to square one. Ask me how I know, again…


To enable add-path we need to configure it on both the Route Reflector and the PE devices:

PE:
router bgp 65534
 bgp router-id 10.13.37.5
 address-family ipv4 unicast
  additional-paths receive
 !
!

RR:
route-policy RR-CLIENTS
  set path-selection backup 1 advertise
end-policy
!
router bgp 65534
 address-family ipv4 unicast
  optimal-route-reflection core 10.13.37.1 10.13.37.2
  additional-paths send
  additional-paths selection route-policy RR-CLIENTS
 !
!

Let’s re-establish the session to make sure the capability has been negotiated:

RP/0/0/CPU0:pe1#clear bgp 10.13.37.255
RP/0/0/CPU0:pe1#show bgp ipv4 unicast neighbors 10.13.37.255 | i Additional
    Additional-paths Send: received
    Additional-paths Receive: advertised
  Additional-paths operation: Receive

If everything is working as intended we should see the route reflector advertising two default routes, and the PE should accept them both and install the best one into the FIB:

RP/0/0/CPU0:rr#show bgp ipv4 unicast neighbors 10.13.37.5 advertised-routes
Network            Next Hop        From            AS Path
0.0.0.0/0          10.13.37.2      10.13.37.2      ?
                   10.13.37.1      10.13.37.1      ?


RP/0/0/CPU0:pe1#show bgp ipv4 unicast
Status codes: s suppressed, d damped, h history, * valid, > best
              i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
   Network            Next Hop            Metric LocPrf Weight Path
* i0.0.0.0/0          10.13.37.2               0    100      0 ?
*>i                   10.13.37.1               0    100      0 ?

This is looking promising. Very promising indeed. Let’s once again increase the metric between cr1 and pe1 to 1000 and see if anything changes. Fingers crossed.

RP/0/0/CPU0:pe1#show bgp ipv4 unicast
Status codes: s suppressed, d damped, h history, * valid, > best
              i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
   Network            Next Hop            Metric LocPrf Weight Path
*>i0.0.0.0/0          10.13.37.2               0    100      0 ?
* i                   10.13.37.1               0    100      0 ?

Look at that, the default route has switched over to the other router - great!

Summary

That’s it. I’ve tried everything I wanted to try regarding BGP-ORR and I’m quite happy with the results. The limit of 32 groups can be a bit limiting in a large network but if using add-path is an option it seems to be a good way forward.

I’m still excited about BGP-ORR and I do look forward trying it out more in a real production network. Combined with add-path I think it’s going to work out just great.

If you’re read this far, thank you! If I’ve made any mistakes, misunderstood something or you just want to say hi, the easiest way is Mastodon.

Further reading

These are some other posts and articles about BGP-ORR that I recommend as further reading:


  1. It’s all relative, COTS hardware and Bird can probably do many tables with no problems. ↩︎

  2. You know what? It wouldn’t surprise me if we somehow could. ↩︎