[cs615asa] Wrong Answers

Fri Apr 6 17:15:07 EDT 2018

I'm in the same position as you both.
I tried (after filtering the en domain)
- total rows
- unique (domain, page title) pairs
- unique page title
- total page views (in case that would constitute a "unique object)
- with/without en.mw (since it's an aggregation for all mobile pages)
- with/without page_title = "-" (there's a remark on this entry on the page
describing the dataset)
- with/without page_title = "" (I believe there's two of these. Its either
two of "" or two of "-")

None of these returned the correct answer for (1). I also tried a scripted
brute-force approach by submitting ranges of answers around what I thought
to be the correct answer, but didn't find a solution.

I have not tried converting url-encoded hex characters (e.g. %2e) before
applying a unique filter, as I assumed those would be converted by either
the browser or the server and also considered this level of scrutiny to be,
as the professor said, thinking too deep.

On Fri, Apr 6, 2018, 4:45 PM Matthew Mahoney <mmahone1 at stevens.edu> wrote:

> Alrighty, so I just got problems 3 and 4 to verify, however I’m not
> entirely sure if the correct answers are actually correct.
>
> There is a single entry in the data set which is “en  2166 17078204”.
>  (Note the two spaces after en as opposed to one)
>
> When I get the correct answers according to the form using awk, I index
> the columns using $4 and $3 for problems 3 and 4 respectively. However, if
> you don’t explicitly set space as a delimiter using awk’s -F argument,
> 17078204 is set as the value for $3 and not $4 like it should be when
> analyzing that one particular problematic entry.
>
> When I instead use $(NF) and $(NF-1), or explicitly set space as the
> delimiter using awk’s -F argument, I get a different answer from the
> accepted one (but the same answer using the two alternate methods), which I
> believe might actually be the correct one, since that would stop awk from
> parsing the data such that columns no longer line up for that one data
> entry.
>
> Albeit, I could be wrong, but this is just an observation I made. If
> someone could check that for me, it would be greatly appreciated.
>
> Also, I’m equally as stuck on problems 1 and 2 though. And got the same
> answer as Jason if I include en.mw for problem 1, and get that value
> minus 1 if I do not include en.mw.
>
> Regards,
> Matthew Mahoney
>
>
>
> Sent from my iPhone
> On Apr 6, 2018, at 3:57 PM, Jason Ajmo <jajmo at stevens.edu> wrote:
>
> Matt,
>
> That makes sense, however, a case-insensitive, unique sort on column 2
> still yields an incorrect answer:
> # gzcat data.gz | grep "^en[\. ]" | sort -ufk 2 | wc -l
>  2232660
>
>
> On Fri, Apr 6, 2018 at 3:49 PM Matthew Gomez <mgomez1 at stevens.edu> wrote:
>
>> I read the question as “the second column must be unique” and there are
>> things like Main_page and Main_Page which I assume are the same object. So
>> I was uniqing on the second column.
>>
>> Matt
>>
>> On Fri, Apr 6, 2018 at 3:43 PM Jason Ajmo <jajmo at stevens.edu> wrote:
>>
>>> I've gotten 3-5, but I'm struggling a little on 1 and 2. I've tried many
>>> different variations, but can't seem to get the right answer. Here are my
>>> two that I feel are "most" correct.
>>>
>>> 1:
>>> # gzcat data.gz | grep "^en[\. ]" | wc -l
>>>  2233318
>>> I didn't feel like sorting or `uniq`ing were necessary since each row
>>> should be unique as it is.
>>>
>>> 2:
>>> # gzcat data.gz | grep "^en[\. ]" | awk '{ print $2 " " $(NF - 1) }' |
>>> sort -nrk 2 | head -n 1
>>> en 3127515
>>> For this one, I had to do a little data transformation with awk since
>>> using sort with -k 3 and no awk was giving clearly incorrect results.
>>>
>>> The initial gzcat and grep are correct, since it's the foundation I used
>>> for 3-5. Any feedback on my statements above would be appreciated.
>>>
>>> Thanks.
>>> --
>>> Jason Ajmo
>>> Stevens Institute of Technology
>>> B.S. Cybersecurity '17
>>> M.S. Computer Science '18
>>> 0x56FA3123
>>>
>> _______________________________________________
>>> cs615asa mailing list
>>> cs615asa at lists.stevens.edu
>>> https://lists.stevens.edu/mailman/listinfo/cs615asa
>>>
>> _______________________________________________
>> cs615asa mailing list
>> cs615asa at lists.stevens.edu
>> https://lists.stevens.edu/mailman/listinfo/cs615asa
>>
> --
> Jason Ajmo
> Stevens Institute of Technology
> B.S. Cybersecurity '17
> M.S. Computer Science '18
> 0x56FA3123
>
> _______________________________________________
> cs615asa mailing list
> cs615asa at lists.stevens.edu
> https://lists.stevens.edu/mailman/listinfo/cs615asa
>
> _______________________________________________
> cs615asa mailing list
> cs615asa at lists.stevens.edu
> https://lists.stevens.edu/mailman/listinfo/cs615asa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.stevens.edu/pipermail/cs615asa/attachments/20180406/970c405f/attachment-0001.html>