Ends in
00
hrs
00
mins
00
secs
SHOP NOW

48-hour Extension Anniversary Sale - $3 OFF ALL Reviewers!

Find answers, ask questions, and connect with our
community around the world.

Home Forums AWS AWS Certified Machine Learning – Specialty Practice Exam Question Confusion

  • Practice Exam Question Confusion

  • JimA

    Member
    September 19, 2022 at 6:37 pm

    I missed an exam question, and I understand why the preferred answers are what they are, but the following explanation for one of the wrong answers seems very misleading. The question:

    A Machine Learning Specialist is developing a Natural Language Processing (NLP) application that processes a large set of collated social media posts. The Specialist’s objective is to run Word2Vec to create embeddings of these posts to make various types of predictions. Which combination of preprocessing techniques should be done to clean the data in a scalable manner?

    One of the wrong answers is: “Use one-hot encoding to all words in a post.” The explanation as for why this is wrong is “one-hot encoding is not suitable for word2vec as it does a poor job of capturing semantics between words. A better approach is to use tokenization to split the post into specific words that can be processed”.

    I understand why this is a wrong answer, but this explanation really should be changed. I have implemented word2vec myself, and one of the steps necessary to use it is to first convert the words into a one-hot encoded form. And it’s not just me….

    This is stated in the original word2vec paper (https://arxiv.org/pdf/1301.3781.pdf), which describes both CBOW and skip-gram as being derived from feedforward NNLM, and the paragraph describing NNLM says that in NNLM, “words are encoded
    using 1-of V coding”. This is the same thing as one-hot encoding.

    So to say “one-hot encoding is not suitable for word2vec” is like saying “opening your mouth is not suitable for eating”. The first one is a necessary step to accomplish the second.

    The explanation could be changed to say something like “using one-hot encoding is not a data cleaning step”. This would be true, and would technically make it a wrong answer, but I think the best path forward is probably just to remove the one-hot encoding answer altogether and replace it with something else.

    • This discussion was modified 1 year, 6 months ago by  JimA.
    • This discussion was modified 1 year, 6 months ago by  JimA.
    • This discussion was modified 1 year, 2 months ago by  Tutorials-Dojo.
Viewing 1 of 1 replies

Log in to reply.

Original Post
0 of 0 posts June 2018
Now