Home › Forums › AWS › AWS Certified Machine Learning – Specialty › Incorrect Explanation of tf-idf
-
Incorrect Explanation of tf-idf
Nikee-TutorialsDojo updated 6 months, 3 weeks ago
2 Members
·
2
Posts
-
The explanation of tf-idf given in the question below is incorrect. Though the answer choice itself is reasonable, the descriptions of both term frequency and inverse document frequency are wrong.
Question:
A Machine Learning Specialist is building a project that runs sentiment analysis for product reviews. Because the validation accuracy is not satisfactory, the Specialist believes that a rich vocabulary plus a low average frequency of words in the training data is causing the issue.
How can the Specialist improve the validation accuracy of the model?
Provided Explanation (incorrect):
The first function (Term Frequency) counts how frequently a word appears in a sentence belonging to a corpus. The second function (Inverse Document Frequency) counts how frequently a word appears in the whole corpus.
Correct definition:
Term frequency is related to the number of times that a term appears in a given document, not a sentence and not across the entire corpus (collection of documents). Inverse document frequency is derived from the proportion of documents in the corpus that contain a word, not how frequently a word appears in the whole corpus overall.
The actual formula is a bit more complicated than the simplified explanation above, but the provided explanation is entirely inaccurate. Source: http://i.stanford.edu/~ullman/mmds/ch1.pdf, pg. 8.
-
Hello zzzz,
Thank you for pointing this out, and we sincerely apologize for the confusion this may have caused. We will be updating the explanation to reflect the correct definitions.
Term Frequency – Inverse Document Frequency (TF-IDF) is a way to turn text into numerical features for machine learning models. Term Frequency (TF) measures how often a word appears in a document, usually divided by the total number of words. Inverse Document Frequency (IDF) measures how rare or common a word is across all documents in the dataset, giving lower scores to words that appear in many documents and higher scores to words that appear in fewer documents. Multiplying TF by IDF gives a score highlighting words that are frequent in one document but uncommon in the whole collection.
In this scenario, using Scikit-learn’s TfidfVectorizer helps reduce the weight of very common words while giving more importance to distinctive words, which can improve model accuracy.
If you notice anything or need additional assistance, please feel free to reach out to us.
Regards,
Nikee @ Tutorials Dojo
Log in to reply.