[cs615asa] New York SRE Tech Talks Report
szhong2
szhong2 at stevens.edu
Mon May 1 15:42:20 EDT 2017
Hi everyone,
I attended the New York SRE Tech Talk last week and here are my summary,
thoughts and comments of the event. I chose this tech talk because I am
always interest in the field of SRE. I think this filed is not only
about site reliability, the methodologies and architectures can be used
in many different kinds of software engineering. Plus, it’s awesome to
have a chance to take a little tour in Facebook’s New York office.
Back to the tech talk. The first topic of the night was "Linux Network
Switches", which was the part I was interested in the most. This topic
was mainly about the Clos topology, which was used by Shapeways to
manage their two data centers.
Compared to traditional "tree-like" topologies, Clos topology has its
advantages. For example, in tree-like topologies, each physical host on
a rack is connected to a single switch, which is likely mounted on the
same rack. And the switch on one rack is connected to a parent switch in
the data center or the server farm, so that different racks can
communicate with each other. If there are more than one data centers,
there will be a root router to route traffics across different data
centers. So in this type of topology, there are at least three layers,
which are root router, parent switch and hosts. If a host wants to
exchange data with another host in a different data center, there will
be at least three hops. But with Clos topology, things are simpler. In
Clos topology, there are only two layers in the network, which are spine
and leaf. Leaf means the physical hosts in the network, and spine is the
root router. It is possible to have more than one spines. And in
abstract, each leaf host is directly connected to all the spines. The
spines handle all the routing across different hosts. So in Clos
topology, only one hop is needed when a leaf host wants to talk to any
other leaf host.
Also, the simplicity of Clos topology brings stability and scalability.
Because there can be more than one spines and all leaf hosts are
connected to all the spines, redundancy can be easily achieve by
duplicating the spine layer. If one spine is down, the other spine will
keep working and the whole network still functions normally. In
tree-like topologies, because there are multiple layers and each layer
has their own scope and function, the price is much higher to achieve
redundancy.
The speaker from Shapeways also mentioned that in their Clos network,
they used a BGP advertising mechanism to update routing information.
When a new leaf host is up or a new spine is added, there will be a
broadcast that tells all the hosts in the network about the new routing
information. This reduces two ways queries when hosts want to locate
another host.
In conclusion, the Clos topology reduces the cost of hardware layers and
it is more efficient, scalable and stable than some traditional network
topologies. I think the Clos topology is useful not only as a network
topology, the architecture can also be ported to other fields. For
example, the two layer architecture can be easily ported to a
distributed system, in which the spine represents the master nodes and
the leaf represents the slave nodes. And the advertising mechanism can
be used to maintain things like the “finger table” in a distributed
system.
The other two topics of the night were “Automating The Linux Kernel
Validation” and “Dependency Traps and Gotchas”. The first one was about
testing for Linux Kernel and CI/CD for Linux Kernel in Facebook. And the
latter was about traps, like deadlocks, in a system consisting of
different micro services. These two topics were not as interesting as
the first one to me, so I am not going to elaborate here in my report. I
am sure other fellow classmates who attended the same event will have a
more detailed description about these two topics.
That's all, thanks for reading.
Shaoliang Zhong
More information about the cs615asa
mailing list