For CA 2, In my Data management and Analytics, I have been lead to use the R language. R is a free software environment for statistical computing and graphics. It compiles and runs with Windows and MacOS. This course powered by Code School, help us to listening how the R language works. But also, gives us some examples and some uses of R.
If you would like to learn more about R, I would recommend you to try Code Academy website. For initial understanding I would recommend you to try theirs courses for introduction to basic expression, Data frames, vectors, matrices and so on.
To confirm that I have complete all the steps of the tutorial of using R, here is a picture of the end of course.
I have use some dataset I have found on the internet. I love animals so I have chosen some data of Animals. I have chosen datasets from kaggle and then I installed the library package and loaded them into R studio.
The data comes from Austin Animal Center from October 1st, 2013 to March, 2016. Outcomes represent the status of animals as they leave the Animal Centre. All animals receive a unique Animal ID during intake.
In this competition, you are going to predict the outcome of the animal as they leave the Animal Centre. These outcomes include: Adoption, Died, Euthanasia, Return to owner, and Transfer.
The train and test data are randomly split.
I downloaded my sample data in excel and saved it in my folder in csv format, Full form of CSV is (Comma sepearted values). Using CSV I imported the data into R and loaded into a dataframe.
Please find the image below my uploaded package.
I would like to organise and see how the outcomes are distributed for the 4800 cats and 6656 dogs in the training set. Both cats and dogs are commonly adopted or transferred, but dogs are much more likely to be returned to their owners than cats. It also appears that cats are more likely to have died compared to dogs. Fortunately, it appears very few animals die or get euthanized considering overall figures below.
Unsurprisingly, baby animals are more likely to be adopted than adult animals. They are also more likely to be transferred and to have died.
I am finally ready to factorize the rest of the variables in prepartion for fitting a model to the data and making a prediction. Let’s fit a randomForest model predicting OutcomeType.
The command I used is below
# Split up train and test data
train <- full[1:26729, ]
test <- full[26730:nrow(full), ]
# Set a random seed
To conclude, So far the most important variable for predicting the outcomes of shelter animals is Ageing Days, As this is the first my first time doing multiclass classification, I made lots of mistake and learned form it. If I had more time I would probably would be So far the most important variable for predicting the outcomes of shelter animals is Ageing Days and not intact.