[cs615asa] Meet Up Report of New York SRE Tech Talks

Tue May 2 23:36:06 EDT 2017

Hello class,

At 7pm on April 26, 2017 at Facebook's NYC Office, I attended
Google/Facebook's meet up, of which topic is Site Reliability. There were
three lecture given as following:

    G.I.F.E.E. data-center network
    Automating The Linux Kernel Validation
    Dependency Traps and Gotchas

Meetup link: https://developers.google.com/events/sre/nyc#april_2017

Why I Choose It

    First reason is I thought this meetup, with the topic of Site
Reliability, will help me have better understanding of how to maintain a
site. As I am currently working at an start-up company, I get chances to
act like a System Admin and in touch with Network(VPC) configuration and
application/server deployments. Therefore, I am intested in topics about
Data Center, Network and System Design(manage dependency).

    Secondly, Site Reliability is primary job of a system admin. So it is
highly related to our class.

Brief Summary

    Lecture 1 : G.I.F.E.E. data-center network

    G.I.F.E.E. represents Google Infrastructure for Everyone Else. After
some research, I learn GIFEE means everyone (small companies and startups)
could use Google's infrastructure tech, like GFS and BigTables, and
Google's approach of managing infrastructure depending on public cloud with
elastic resource.

    Why to mention GIFEE? Because it is a GIFEE generation now and
everything depends on public cloud, network, specifically data-center
network, comes of importance. The talker is from ShapeWays, which is a
company providing service of physical data-center.

    The talker introduced "old" network topology, in which data-center(DC)
acts like root of tree. So if two server belong to different DC want to
talk with each other, the message has to go all the way up to their DC
(tree root), and go to the other DC by then link between this two roots,
then go all the way down to the other server. In this case, the connection
will be highly depending on the availability of DC and the traffic between
two DC will be large.

    To overcome such shorten, he introduced a "new" approach, CLOS
topology, of which structure is Leaf & Spine like. In other words, every
DCs connect to all servers. In such topology, connections between servers
do not depend on the connections between DCs and the network will be still
available even though some of DCs go down.

    ShapeWays also improves site reliability within CLOS topology. In the
"old" approach, most servers on racks are on layer2 while only the server
connecting with DC are on layer 3. It will be hard to assign static IP to
all servers, so they make layer 3 everywhere (every server has IP
10.0.0.1/32 then broadcasting). I think, (in my understanding, Sorry I did
not fully understand this part), they broadcasting all messages within the
rack, so every servers in the rack can be directly reached by DC.

    The advantage of this approach is that, all tools we familiar, such as
traceroute, ping, vtysh and mtr, can work from DC to all servers. Another
advantage is it is faster benefited from it's route structure. Talker
mentioned that FRRouting is better than Quagga. Sadly, I did not understand
this part.

    Last, talker introduce the hardware ShapeWays using. They use Penguin
network switches and 40G link in spine (DC to server), and 10G for leaf
(server to server). He also mentioned many tool about manage such network,
like Puppet.

    Lecture 2 : Automating Linux kernel validation

    Talker is from Facebook and working on Linux kernel.

    # Why kernel testing is hard?

        1. Kernel version changes fast.
        2. It is built depending on hardware, rather than on some
system/platform/environment. Therefore, we have to have some real work load
to see whether if it works fine.

    # What to test?

        1. Performance test: regression
        2. Stability
        3. Security

    # What community doing?

        1. Code reviews

        Let experienced coder to reviews the code will always help a lot.
The author pretends to be trapped by his own thought.

        2. Maintainer testing

        Let kernel be tested by its maintainer. The maintainer is the most
familiar person on such kernel.

        3. Test Packages , Use tools!
        Tools:
            1. Linux Test Project (Basic one)
                2. Xfstests
               3. Fio Flexible I/O tester
               4. Kernel selftest
               5. Perf test

           4. Kernel fuzzing (Send garbage to test)

           Fuzzing : Sending meaningless, arbitrary, and random input into
kernel to see what happen.

           Tools:
               1. Trinity
            2. Syzkaller
            3. Perf_fuzzer

        5. Static analysis and debugging tools
        Tools:
            1. Coccinelle
            2. Gcc warnings
            3. Kasan
            4. Others debug compile options

    # Extending tests (Good idea)

        1. Network testing
        2. Replicate workload
        3. Performances regressions
        4. Specific bugs
        5. Adding coverage

    Dependency Traps and Gotchas

    Talker is from Google who in charge of solving dependency. She
introduced two stories.

    # First Story ： Server - Client Story

    Talker show the process of how to partition a system into many
micro-systems.

    In the very beginning, there is a client and a server.

    Server is a system -> server is partitioned as Front End (FE) and Back
End (BE) and DB -> both FE, BE and DB are consist of encryption system and
ACL system -> .........

    Advantages:

        1. Good for replication and load balance
        2. Good for both horizontal and vertical scaling

    Disadvantage:

        It's very hard to understand dependencies and debug !

    We could use token to track the process in RPC, aka Remote Procedure
Call. Dependency Discovery (tracking token) could help with:

        1. Draw performance profiles for sub systems
        2. Identifying bottlenecks
        3. Most commonly exercised traffic paths
        4. Debug

    Discovery is not enough, it cannot help with design purpose.

    # Second Story : Dependency Traps

    When we design a system, there would be many teams in charge of
different micro systems.

    Assume a design:

        Storage team build a Database system,
        Privacy team build ACL(Access Control List) system,
        Security team build encryption system.
        DB -> ACL & Encryption (cannot login without authorized by ACL,
cannot connect without encryption)
        ACL -> DB & Encryption (List is stored on DB and encrypted)
        Encryption -> ACL & DB (Cipher is stored on DB and cannot decrypt
without authorization)
        "->" represents dependency

    For each systems, it depends on the other two. It looks fine. but make
them together, comes a circle. Then something bad happens. Assume that we
have Site A and Site B in such design:

        1. bootstrap dependency loop

        Assume we have site A works fine, and want to launch a new same
site B.
        We can use same Encryption system same Public/Private key of A.
Build DB in Site B, then use A's ACL to set up B's ACL. In such scenario,
we cannot bootstrap site B without existing site A.

        2. dependency lock

        Assume that site both A and B have the key for their DB. If
Encryption system of A goes down, we could still use B's Encryption system
to decrypt and access both DB. However, if Encryption systems of both A and
B go down, we lost the accessibility of the databases of A & B.

    Therefore, make sure dependency design:

        1. Differentiate between bootstrap and runtime dependencies

        Even though it seems works fine in runtime, there still may be
bootstrap dependencies.

        2. Avoid local cycle
        3. Avoid dependencies bootstrap remotely.

What I learned

    As I describe in the summary, I learn:

    1. Some network topologies and approaches, what is the
advantages/disadvantages and limitation of them.
    2. What is G.I.F.E.E
    3. Although I have almost no idea about kernel test, I could still
learn some testing methodologies and trains and thought.
    4. Heard about many tools! System Admin ought to know tools and choose
proper one from them!
    5. How to design a easy scaling and maintaining system by avoiding
dependency traps.

That's all. Thank you for reading such long report. Hope everyone reading
could learn something and welcome to having a discuss by reply.

Best Regards,

Jinglong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.stevens.edu/pipermail/cs615asa/attachments/20170502/253679aa/attachment-0001.html>