[cs615asa] New York SRE Tech Talks Summary

Liang He lhe16 at stevens.edu
Mon May 1 12:47:51 EDT 2017


Hello everyone,



I want to share my meetup experience with you. I attended on Wednesday,
April 26, at Facebook's office.



The meetup I attended contains three topics.


The first one, is about Linux network switch. In this part, speaker
introduced Google's Infrastructure for Everyone Else (GIFEE), which is a
pioneering container-based distributed infrastructure. It is also
demonstrated both the value and the methods of deploying and running
large-scale distributed system.

Speaker mentioned old network topology with its disadvantages. As we know
root level switch is core level in the old network. If an endpoint host
want to send requests to another endpoint host, the packages it send may
access upper switches to arrive root level first, after that those packages
could go through another root switch and many child switches to reach the
destination host. Cisco and Arista are famous for this kind of multi-layer
network switches.  This complex process is invisible for end hosts.

But speaker showed use a new network topology, CLOS Topology.

In this case, every lower-tier switch is connected to each of the top-tier
switches. The top switches named spine switches and lower-tier switches
called leaf switches. The advantage of Clos topology is you can use a set
of identical and inexpensive devices to create the topology tree and gain a
pretty good performance. Speaker said Clos topology use eBGP.



The second topic is about Linux Kernel in Production.

>From the speaker I knew stability issues, velocity changes and security
problem are the main problems Linux kernel facing nowadays. From this part,
I roughly understood the process of testing Linux kernel after development.
Speaker introduced many Linux testing project: xfstests, Flexible I/O
Tester, Kernel selftset and so on, and here are some kernel test modules
they provide, like Perf Tests. As for improve our kernel, we also need
static analysis and debugging tools, such as GCC Warning, Kasan and
Coccinelle. If you want a perfect kernel, some extra tests are needed. At
that time you will need extending tools: Network testing, replicate
netload, performance regression and specific bugs testing.



The last topic is Dependency Traps.

>From clients' view, if they want to use Internet, they just click the
button and remote servers will return what they want. It seems like remote
server is just machine, you ask and it will give you answer. But actually
it is more complicated. At the server side, it contains two part: Frontend
and Backend. Frontend is the interface of clients, which usually provide
interaction UI with clients. Backend response clients' requests. At the
backend, there are three main section, network access control list (ACL),
Encryption and Databases. In general, frontend get clients' requests and
deliver requests to backend, backend send what it get to encryption section
to decrypt the contents from requests. After knowing the request content,
backend will check the ACL to find what kind of services are allowed, and
then encrypt request and send it to databases.

Speaker side many potential dependency problems may come from the multiple
servers. One kind of trap is data dependency lock. For example the key of
encryption on server A is stored in the database on server B, and the key
of encryption on server B is stored in the database on server A. In this
situation, if one request is sent to server A, server A will need access
database B to get key A, but what will happen if a request reach server B
at same time which need access database A to get the key B? Both of them
will keep waiting. It is called lock.

Another kind of problem is called bootstrap dependence loop. For example,
backend need check ACL, but the content of ACL is stored in the database,
but if we want to access database, we need backend send request to it.
However at this time, backend is waiting for result from ACL. A dependency
loop is appear: Backend-->ACL-->Database-->Backend. So whole server will
keep waiting.

Speaker also pointed out simulation on network makes those kind of problem
little bit hard to test.



Those are what I learn and think from this meetup, and may have some
misunderstood because my knowledge is limited. I am glad to hear your
thought and have a discussion with you. Thanks for reading.


Link to the meetup: https://developers.google.com/events/sre/nyc#april_2017



Liang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.stevens.edu/pipermail/cs615asa/attachments/20170501/3e9bff8a/attachment.html>


More information about the cs615asa mailing list