Unsure of a question in ML exam (changing the cost function)

Tagged: machine-learning

Unsure of a question in ML exam (changing the cost function)

Carlo-TutorialsDojo updated 3 years ago 2 Members · 4 Posts
AWS Certified Machine Learning – Specialty
- machine-learning
goking

Member
April 7, 2021 at 11:34 am

The question is :

A company is hosting a free-to-play online game with over a million active users. The game profits by inducing players to spend money on buying loot boxes. A Machine Learning Specialist uses data from 500,000 random users to train an XGBoost model that predicts players who are likely to buy at least 5 boxes within a month based on age, gender, playing hours, engagement patterns, etc. The collected data contains 150,000 positive samples and 350,000 negative samples. The model has high accuracy on the training dataset but low on the test dataset.

Which methods could the Specialist do to rectify the problem? (Select TWO.)

Increase the maximum depth of a tree.

Choose random samples of the training data and copy them to the test data.

Copy a subset of the positive samples and add noise to the copied data.

Tweak the cost function in such a way that the impact of false negatives on cost value is higher than false positives.

Tweak the cost function in such a way that the impact of false positives on cost value is higher than false negatives.

While Copy a subset of the positive samples and add noise to the copied data is obvious the answers suggest Tweak the cost function in such a way that the impact of false positives on cost value is higher than false negatives. Since it has far more negative samples than the model can simply predict almost all to be negative. So we should minimize that. The False negatives.

The answer given False positives in this scenario pertain to players who were predicted to buy at least 5 loot boxes but did not do so in reality. The company should be concerned more about Precision as it has more weight in terms of cost value. – doesn’t make sense to me.
Carlo-TutorialsDojo

Member
April 8, 2021 at 2:08 am
Hello goking,

Thanks for sharing your feedback.

This item is a case of recall vs precision. The formula for precision is TP / (TP + FP ) where TP is True Positive and FP is False Positive. The formula for recall is TP / (TP + FN ) where TP is True Positive and FN is False Negative. Since precision considers false positives, it should have more weight in determining the cost function. False negatives in this scenario simply refer to players who were not predicted to buy at least 5 loot boxes but ended up paying for it anyway. There’s no loss in false positives as the company still gains money for false negative predictions. Hence, precision is more important.

Let me know if this makes sense.

Regards,

Carlo @ Tutorials Dojo
- This reply was modified 3 years ago by Carlo-TutorialsDojo.
goking

Member
April 8, 2021 at 2:29 am

There’s no loss in false positives as the company still gains money for false negative predictions.

That is true, but the question isn’t formulated like that. “The model has high accuracy on the training dataset but low on the test dataset. ” So i will give more extreme example ->

9990 negative and 10 positive samples. The model can simply predict all to be negative (99.9 accurate) and have 10 false negatives. But if the test data is 5000 positive and 5000 negative the accuracy will fall to 50%. In that case, we should minimize FN and should be more tolerant of false positives.(Even if it guessed positive and it is not – it is ok,because real data might contain much much more)

“Tweak the cost function in such a way that the impact of false negatives on cost value is higher than false positives.” – I undestand this as mininimze the false negatives.

The current example is not so extreme but clearly much more negatives than positives.
- Carlo-TutorialsDojo
  
  Member
  April 9, 2021 at 2:17 am
  
  “The model can simply predict all to be negative (99.9 accurate) and have 10 false negatives.”
  
  >> I understand that you’ve put it in simple terms but for the sake of others, I’d say that the prediction will favor the majority class but the predictions do not necessarily be distributed in True Negatives and False Negatives only.
  
  “But if the test data is 5000 positive and 5000 negative the accuracy will fall to 50%. In that case, we should minimize FN and should be more tolerant of false positives.(Even if it guessed positive and it is not – it is ok,because real data might contain much much more)”
  
  >> Both FN and FP reduces the accuracy of the model. In this case, we won’t know how many are FN or FP unless we have the prediction results. But for brevity’s sake, let’s consider the following prediction results:
  
  TN = 400,000
  
  TP = 75,000
  
  FP = 10,000
  
  FN = 15,000
  
  Let’s say that for every predicted false positive, the company loses $5. The company does not lose anything for FN. We shouldn’t be more tolerant of false positives because it has more impact in terms of costs.

Viewing 1 - 3 of 3 replies

Unsure of a question in ML exam (changing the cost function)

goking

Carlo-TutorialsDojo

goking

Carlo-TutorialsDojo