Scaling up in splitprint

8/17/2023

It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points). Note: Recap of accuracy, precision, recall ¶Īccuracy measures how often the classifier makes the correct prediction. That been said, using that prediction would be pointless: If we predicted all people made less than \$50,000, CharityML would identify no one as donors. It is always important to consider the naive prediction for your data, to help establish a benchmark for whether a model is performing well. This can greatly affect accuracy, since we could simply say "this person does not make more than $50,000" and generally be right, without ever looking at the data! Making such a statement would be called naive, since we have not considered any information to substantiate the claim. Looking at the distribution of classes (those who make at most $50,000, and those who make more), it's clear most individuals do not make more than $50,000. $$ F_$ score (or F-score for simplicity). We can use F-beta score as a metric that considers both precision and recall: Therefore, a model's ability to precisely predict those that make more than \$50,000 is more important than the model's ability to recall those individuals. Additionally, identifying someone that does not make more than \$50,000 as someone who does would be detrimental to *CharityML*, since they are looking to find individuals willing to donate.

It would seem that using accuracy as a metric for evaluating a particular model's performace would be appropriate. Because of this, *CharityML* is particularly interested in predicting who makes more than \$50,000 accurately. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.ĬharityML, equipped with their research, knows individuals that make more than \$50,000 are most likely to donate to their charity.race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.This project was set up and graded by Udacity (Machine Learning Engineer Nanodegree) The data we investigate here consists of small changes to the original dataset, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries. You can find the article by Ron Kohavi online. The datset was donated by Ron Kohavi and Barry Becker, after being published in the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". The dataset for this project originates from the UCI Machine Learning Repository.

While it can be difficult to determine an individual's general income bracket directly from public sources, we can (as we will see) infer this value from other publically available features. Understanding an individual's income can help a non-profit better understand how large of a donation to request, or whether or not they should reach out to begin with. This sort of task can arise in a non-profit setting, where organizations survive on donations. My goal with this implementation is to construct a model that accurately predicts whether an individual makes more than $50,000. I will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. In this project, I will employ several supervised algorithms of your choice to accurately model individuals' income using data collected from the 1994 U.S.

0 Comments

Scaling up in splitprint

Leave a Reply.

Author

Archives

Categories