[cs615asa] Wrong Answers

Fri Apr 6 16:45:36 EDT 2018

Alrighty, so I just got problems 3 and 4 to verify, however I’m not entirely sure if the correct answers are actually correct.

There is a single entry in the data set which is “en  2166 17078204”.  (Note the two spaces after en as opposed to one)

When I get the correct answers according to the form using awk, I index the columns using $4 and $3 for problems 3 and 4 respectively. However, if you don’t explicitly set space as a delimiter using awk’s -F argument, 17078204 is set as the value for $3 and not $4 like it should be when analyzing that one particular problematic entry.

When I instead use $(NF) and $(NF-1), or explicitly set space as the delimiter using awk’s -F argument, I get a different answer from the accepted one (but the same answer using the two alternate methods), which I believe might actually be the correct one, since that would stop awk from parsing the data such that columns no longer line up for that one data entry. 

Albeit, I could be wrong, but this is just an observation I made. If someone could check that for me, it would be greatly appreciated.

Also, I’m equally as stuck on problems 1 and 2 though. And got the same answer as Jason if I include en.mw for problem 1, and get that value minus 1 if I do not include en.mw.

Regards,
Matthew Mahoney

Sent from my iPhone
> On Apr 6, 2018, at 3:57 PM, Jason Ajmo <jajmo at stevens.edu> wrote:
> 
> Matt,
> 
> That makes sense, however, a case-insensitive, unique sort on column 2 still yields an incorrect answer:
> # gzcat data.gz | grep "^en[\. ]" | sort -ufk 2 | wc -l
>  2232660
> 
> 
>> On Fri, Apr 6, 2018 at 3:49 PM Matthew Gomez <mgomez1 at stevens.edu> wrote:
>> I read the question as “the second column must be unique” and there are things like Main_page and Main_Page which I assume are the same object. So I was uniqing on the second column. 
>> 
>> Matt
>> 
>>> On Fri, Apr 6, 2018 at 3:43 PM Jason Ajmo <jajmo at stevens.edu> wrote:
>> 
>>> I've gotten 3-5, but I'm struggling a little on 1 and 2. I've tried many different variations, but can't seem to get the right answer. Here are my two that I feel are "most" correct.
>>> 
>>> 1:
>>> # gzcat data.gz | grep "^en[\. ]" | wc -l
>>>  2233318
>>> I didn't feel like sorting or `uniq`ing were necessary since each row should be unique as it is.
>>> 
>>> 2: 
>>> # gzcat data.gz | grep "^en[\. ]" | awk '{ print $2 " " $(NF - 1) }' | sort -nrk 2 | head -n 1
>>> en 3127515
>>> For this one, I had to do a little data transformation with awk since using sort with -k 3 and no awk was giving clearly incorrect results.
>>> 
>>> The initial gzcat and grep are correct, since it's the foundation I used for 3-5. Any feedback on my statements above would be appreciated.
>>> 
>>> Thanks.
>>> -- 
>>> Jason Ajmo
>>> Stevens Institute of Technology
>>> B.S. Cybersecurity '17
>>> M.S. Computer Science '18
>>> 0x56FA3123
>> 
>>> _______________________________________________
>>> cs615asa mailing list
>>> cs615asa at lists.stevens.edu
>>> https://lists.stevens.edu/mailman/listinfo/cs615asa
>> _______________________________________________
>> cs615asa mailing list
>> cs615asa at lists.stevens.edu
>> https://lists.stevens.edu/mailman/listinfo/cs615asa
> -- 
> Jason Ajmo
> Stevens Institute of Technology
> B.S. Cybersecurity '17
> M.S. Computer Science '18
> 0x56FA3123
> _______________________________________________
> cs615asa mailing list
> cs615asa at lists.stevens.edu
> https://lists.stevens.edu/mailman/listinfo/cs615asa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.stevens.edu/pipermail/cs615asa/attachments/20180406/8f8829fd/attachment.html>