[srcit_announce] linux-lab replacement system: lab.srcit.stevens.edu

SRCIT Staff srcit at stevens.edu
Tue Oct 4 11:07:40 EDT 2011


SRCIT Users,

After monitoring our current 'linux-lab' system, we have decided to implement
the system in a more robust way. The new deployment and intended replacement for
the current 'linux-lab' system can be accessed here:

    * lab.srcit.stevens-tech.edu
    * lab.cs.stevens-tech.edu

      or

    * lab.srcit.stevens.edu
    * lab.cs.stevens.edu


The new 'lab' deployment serves to solve the 3 most pertinent issues we noticed
with the current 'linux-lab' deployment:

    * confusion about SSH-key messages
    * connections to 'linux-lab' hanging
    * overloaded hosts while other hosts are underutilizied

While there are known solutions to all of these issues, we understand that
people mostly just want to be able to get their work done without having to deal
with these issues on a daily basis.

Much like the current 'linux-lab' system, the new system can also be accessed by
individual hostname. The 8 new hosts in the 'lab.srcit.stevens-tech.edu' system 
can be accessed by individual hostname as follows:

    * avalon.srcit.stevens-tech.edu
    * avatar.srcit.stevens-tech.edu
    * rainman.srcit.stevens-tech.edu
    * gump.srcit.stevens-tech.edu

    * gits.srcit.stevens-tech.edu
    * eva.srcit.stevens-tech.edu
    * nemo.srcit.stevens-tech.edu
    * smurf.srcit.stevens-tech.edu

If you need to write networking code, you should make sure to use the specific
host name(s). Please see the first part of the section 'More Info and Known
Issues' below.

If you are measuring runtimes of algorithms / programs, please see the second
part of the 'More Info and Known Issues' section below.

Again, the host names are listed above and are organized into one group of 4 
live-action movies and one group of 4 animated movies to ease the issue of 
remembering hostnames. 

Please try out the new system and let us know any issues you may have with the
new installation.

Thanks,

     SRCIT Staff

________________________________________________________________________________
________________________________________________________________________________


[More Info and Known Issues]


Networking Code
---------------

There is one issue still remaining that presents a problem to users who use the
system for writing networking code. This is not a new issue and the same
resulting problem, although slightly differing in precise causes,  exists in the
current 'linux-lab' deployment. That is, when writing networking code, please do
not use the hostname 'lab.srcit.stevens-tech.edu' as the hostname to run a 
server on. The issue is that 'lab.srcit.stevens-tech.edu.' is simply a name to
use for your convenience. 'lab.srcit.stevens-tech.edu' is actually just an SSH
load balancer and will forward your connection to one of the hosts listed above.
Depending on current load, your next connection may be forwarded to a different
host, thus running a network program that waits for connections by starting your
server on 'lab.srcit' and then trying to connect your client to 'lab.srcit' may
connect your client to a different host and thus fail to work as you may have
intended things to.

This same problem exists in the current 'linux-lab' deployment as the DNS Round
Robin scheme in place will also send each subsequent connection to a different
host in the list of current 'linux-lab' hosts. Thus, this is indeed not a new
issue for network coders, but we'd just like to remind people that this issue
still exists.


Algorithms Analysis Code and timing measurements
------------------------------------------------

The 'lab.srcit' system is a system of 2 machines, each virtualized into 4
machines. Each virtual machine does actually have its own CPU though and, unlike
the current 'linux-lab', each host has exactly the same configuration.

However, due to the sharing of physical memory between hosts, there is a chance
that one virtual host will hog a lot of memory and thus slow down the other
hosts in a way that is strange and difficult for a non-root user to diagnose.

Keeping this in mind, please time your algorithms correctly. That is, DO NOT use
the 'wall-clock' time to estimate the running time of your algorithms. ONLY use
metrics that are obtained from the actual timing mechanisms that are kept track
of inside the linux kernel. There are many ways to measure the time spent
executing and please keep in mind that 'time' may be a shell-builtin command
while '/usr/bin/time' is an executable.

Please read the man page for 'time' in section 7 of the manual, which you can
access by entering the follwing command at the shell prompt:
    bash$ man -s 7 time

More information can be found here when trying to get a handle on
the available mechanisms for a programmer to measure running-times:

    bash$ man -k time | egrep -e '^\<times?\>'

as well as this similar command:

    bash$ man -k time | egrep -e '^\<times?'

Also please note the special case that is documented in the time(1) manual page
in the section "ACCURACY", which states as follows:

--------------------8<-------------------8<-------------------8<----------------
    ACCURACY

           The  elapsed time is not collected atomically with the execution of
           the program; as a result, in bizarre circumstances (if the time
           command gets stopped or swapped out in between when the program being
           timed exits and when time calculates how long it took to run), it
           could be much larger than the actual execution time.

           When the running time of a command is very nearly zero, some values
           (e.g., the percentage of CPU used) may be reported as either zero
           (which is wrong) or a question mark.

           Most information shown by time is derived from the wait3(2) system
           call.  The numbers are only as good as those returned by wait3(2).
           On systems that do not have a  wait3(2)  call  that returns status
           information, the times(2) system call is used instead.  However, it
           provides much less information than wait3(2), so on those systems
           time reports the majority of the resources as zero.

           The  `%I'  and `%O' values are allegedly only `real' input and output
           and do not include those supplied by caching devices.  The meaning of
           `real' I/O reported by `%I' and `%O' may be muddled for workstations,
           especially diskless ones.

--------------------8<-------------------8<-------------------8<----------------

In short, it is your responsibility to make sure that your timing measurements
are taken with the correct tools. However, if you suspect a problem with the 
system, due to the virtualization of the hosts, then please file a ticket with 
helpdesk of Request Type 'SRCIT' and we'll look into it.


[EOF]


More information about the srcit_announce mailing list