[cs615asa] HW5

Mon Apr 17 16:33:53 EDT 2017

Hello,

I'll be sending out grades for HW5 in a few minutes.  As discussed,
there are of course many different ways of finding an answer to the
questions.  What's interesting is that many of you not only found
different methods, but ended up with different results.

This is in part due to the ambiguity in the definition, but it's also
due to the pitfalls in the data set and how you might use the tools.
For example, if you use awk(1) with a default field separator of spaces,
then it will treat a sequence of whitespace as one separator, while e.g.
cut(1) would treat it as multiple fields.  So for example, on lines that
have an empty field, "$3" in awk may not be what you expect.

This is a good exercise to show how you have to carefully review your
solution and possibly double-check using a different approach: if two
methods yield different results, something's askew.

Here are the results I came up with:

0) Let's save some typing:

export FILE=

1) Just count the number of lines, as per the data definition:

$ gzcat $FILE | wc -l
 9771932

2) Either use "en " (thereby strictly looking for the "en" domain), or
accept all "en.XYZ" domains, but ignore ".mw" subdomain per:
https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-raw#Aggregation_for_.mw

$ gzcat $FILE | grep "^en " | wc -l
 2167954
$ gzcat $FILE | egrep "^en[ |.]" | grep -v ".mw" | wc -l
 3346029

(In the following, I'll simply include all the ".mw" data.)

3) As above for the language/domain distinction, then get the largest
third field.  Caveat: if the title is empty, then the "third" field is
the second.  So change to using the next-to-last field instead:

$ gzcat $FILE | egrep "^de " | awk '{print $(NF-1) " " $2;}' | \
        sort -n | tail -1 | awk '{print $NF}'
Wikipedia:Hauptseite

4) Sum up the count for each object (i.e. the next to last field), then
divide by number of seconds per hour:

$ gzcat $FILE | awk '{ sum+=$(NF-1); } END { print sum/3600;}'
9999.72

5) Sum up the last field; divide by convenient multiples of 2 to yield GB:

$ gzcat $FILE | awk '{ sum+=$NF; } END { print sum / 1024 / 1024 / 1024; }'
745.65

6) For each "fr" object, calculate total response size (last field)
divided by count to yield size, then sort:

$ gzcat $FILE | awk '/^fr / { print $NF / $(NF - 1) " " $2; }' | \
        sort -n | tail -1 | awk '{print $NF}'
Projet:Palette/Maintenance/Listes

7) One example shown in class / lecture 01 slides.  Just dividing by
whitespace is a bit lazy, since that yields a giant html/css/js block as
a "word", which it clearly is not.

I hope you play around a bit with this data set and the tools to see
what other statistics you could extract and so as to get more
comfortable with these little command-line tools.

-Jan