titanic dataset solution

The sinking of the RMS Titanic is one of the most infamous shipwrecks inhistory. Survived: Shows … There are many questions I am working on the Titanic dataset. here. Dataset was obtained from kaggle Udacity Data Analyst Nanodegree First Glance at Our Data. Titanic Disaster Problem: Aim is to build a machine learning model on the Titanic dataset to predict whether a passenger on the Titanic would have been survived or not using the passenger data. Directly underneeth it, I put a screenshot of the gridsearch's output. That’s why the threshold plays an important part. As far as my story goes, I am not a professional data scientist, but am continuously striving to become one. We can also spot some more features, that contain missing values (NaN = not a number), that wee need to deal with. Most passengers from third class died, may be they didn’t get the fair It will be much more tricky, to deal with the ‘Age’ feature, which has 177 missing values. Just note that out-of-bag estimate is as accurate as using a test set of the same size as the training set. Thomas Andrews, her architect, died in the disaster. Lets investigate and transfrom one after another. maiden voyage from Southhampton, you can read more about whole route Our model has a average accuracy of 82% with a standard deviation of 4 %. The plot above confirms our assumption about pclass 1, but we can also spot a high probability that a person in pclass 3 will not survive. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. So you’re excited to get into prediction and like the look of Kaggle’s excellent getting started competition, Titanic: Machine Learning from Disaster? There is also another way to evaluate a random-forest classifier, which is probably much more accurate than the score we used before. On April 15, 1912, during her maiden voyage, the Titanic sankafter colliding with an iceberg, killing 1502 out of 2224 passengers andcrew.In this Notebook I will do basic Exploratory Data Analysis on Titanicdataset using R & ggplot & attempt to answer few questions about TitanicTragedy based on dataset. The first row is about the not-survived-predictions: 493 passengers were correctly classified as not survived (called true negatives) and 56 where wrongly classified as not survived (false positives). The red line in the middel represents a purely random classifier (e.g a coin flip) and therefore your classifier should be as far away from it as possible. The random-forest algorithm brings extra randomness into the model, when it is growing the trees. Sep 8, 2016. Since the Ticket attribute has 681 unique tickets, it will be a bit tricky to convert them into useful categories. Experts say, ‘If you struggle with d… So we will drop it from the dataset. Though NA values in Survived here only represent test data set so Embarked:Convert ‘Embarked’ feature into numeric. Then we discussed how random forest works, took a look at the importance it assigns to the different features and tuned it’s performace through optimizing it’s hyperparameter values. Data extraction : we'll load the dataset and have a first look at it. So far my submission has 0.78 score using soft majority voting with logistic regression and random forest. Blue line of Survivors crosses Red line of Non-Survivors for children & As a result of that, the classifier will only get a high F-score, if both recall and precision are high. Below you can see the code of the hyperparamter tuning for the parameters criterion, min_samples_leaf, min_samples_split and n_estimators. Note that because the dataset does not provide labels for their testing-set, we need to use the predictions on the training set to compare the algorithms with each other. Cleaning : we'll fill in missing values. Above you can see that ‘Fare’ is a float and we have to deal with 4 categorical features: Name, Sex, Ticket and Embarked. The score is not that high, because we have a recall of 73%. A cabin number looks like ‘C123’ and the letter refers to the deck. import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns % matplotlib inline filename = 'titanic_data.csv' titanic_df = … Machine Learning techniques are to be applied to predict which passenger survived and which did not. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. In this Notebook I will do basic Exploratory Data Analysis on Titanic It provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age and survival. Below you can see how a random forest would look like with two trees: Another great quality of random forest is that they make it very easy to measure the relative importance of each feature. here. Another way is to plot the precision and recall against each other: Another way to evaluate and compare your binary classifier is provided by the ROC AUC Curve. I separated the importation into six parts: I took some nerve to start the Kaggle but am really glad I did. Above you can see the 11 features + the target variable (survived). During this process we used seaborn and matplotlib to do the visualizations. Embed. It is simply computed by measuring the area under the curve, which is called AUC. 21/11/2019 Titanic Data Science Solutions | Kaggle )) Title. Upload data set. Train a logistic classifier on the "Titanic" dataset, which contains a list of Titanic passengers with their age, sex, ticket class, and survival. Assumptions : we'll formulate hypotheses from the charts. We will plot the precision and recall with the threshold using matplotlib: Above you can clearly see that the recall is falling of rapidly at a precision of around 85%. Another thing that can improve the overall result on the kaggle leaderboard would be a more extensive hyperparameter tuning on several machine learning models. You can’t build great monuments until you place a strong foundation. On April 15, 1912, during her maiden voyage, the Titanic sank Udacity Data Analyst Nanodegree First Glance at Our Data. Sep 8, 2016. Lets explore this further in next question. Machine Learning (advanced): the Titanic dataset¶. using regular expressions & binning. These are first few records from titanic dataset. elders who were saved first. Therefore we’re going to extract these and create a new feature, that contains a persons deck. First of all, that we need to convert a lot of features into numeric ones later on, so that the machine learning algorithms can process them. Like SibSp Class 3 Passengers had more then 3 children or large We tweak the style of this notebook a little bit to have centered plots. axes = sns.factorplot('relatives','Survived', train_df = train_df.drop(['PassengerId'], axis=1), train_df = train_df.drop(['Name'], axis=1), train_df = train_df.drop(['Ticket'], axis=1), X_train = train_df.drop("Survived", axis=1), sgd = linear_model.SGDClassifier(max_iter=5, tol=, random_forest = RandomForestClassifier(n_estimators=100), gaussian = GaussianNB() gaussian.fit(X_train, Y_train) Y_pred = gaussian.predict(X_test) acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2), decision_tree = DecisionTreeClassifier() decision_tree.fit(X_train, Y_train) Y_pred = decision_tree.predict(X_test) acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2), importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)}), train_df = train_df.drop("not_alone", axis=1), print("oob score:", round(random_forest.oob_score_, 4)*100, "%"), param_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 5, 10, 25, 50, 70], "min_samples_split" : [2, 4, 10, 12, 16, 18, 25, 35], "n_estimators": [100, 400, 700, 1000, 1500]}, from sklearn.model_selection import GridSearchCV, cross_val_score, rf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1, n_jobs=-1), clf = GridSearchCV(estimator=rf, param_grid=param_grid, n_jobs=-1), “Titanic: Machine Learning from Disaster”, Noam Chomsky on the Future of Deep Learning, An end-to-end machine learning project with Python Pandas, Keras, Flask, Docker and Heroku, Ten Deep Learning Concepts You Should Know for Data Science Interviews, Kubernetes is deprecating Docker in the upcoming release, Python Alone Won’t Get You a Data Science Job, Top 10 Python GUI Frameworks for Developers. Kaggle has a a very exciting competition for machine learning enthusiasts. Therefore it outputs an array with 10 different scores. Details can be obtained on 1309 passengers and crew on board the ship Titanic. # get info on features titanic.info() You are now able to choose a threshold, that gives you the best precision/recall tradeoff for your current machine learning problem. The RMS Titanic was the largest ship afloat at the time it entered service and was the second of three Olympic-class ocean liners operated by the White Star Line. Fare:Converting “Fare” from float to int64, using the “astype()” function pandas provides: Name:We will use the Name feature to extract the Titles from the Name, so that we can build a new feature out of that. You could also do some ensemble learning. Welcome to part 1 of the Getting Started With R tutorial for the Kaggle Titanic competition. Solution: We will use the ... Now, let’s have a look at our current clean titanic dataset. People are keen to pursue their career as a data scientist. Share Copy sharable link for this gist. Therefore when you are growing a tree in random forest, only a random subset of the features is considered for splitting a node. The sinking of the RMS Titanic is one of the most infamous shipwrecks in Lets try to draw few insights from data using Univariate & Bivariate Image Source Data description The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. 1. The code below perform K-Fold Cross Validation on our random forest model, using 10 folds (K = 10). Dataset was obtained from kaggle(https://www.kaggle.com/c/titanic/data). Most variables in dataset are categorical, here I will update their The dataset describes a few passengers information like Age, Sex, Ticket Fare, etc. Getting started with Kaggle Titanic problem using Logistic Regression Posted on August 27, 2018. We import the useful li… It starts from 1 for first row and increments by 1 for every new rows. In this challenge, we are asked to predict whether a passenger on the titanic would have been survived or not. For each person the Random Forest algorithm has to classify, it computes a probability based on a function and it classifies the person as survived (when the score is bigger the than threshold) or as not survived (when the score is smaller than the threshold). The Titanic data set is said to be the starter for every aspiring data scientist. Of course we also have a tradeoff here, because the classifier produces more false positives, the higher the true positive rate is. Because of that you may want to select the precision/recall tradeoff before that — maybe at around 75 %. Titanic: Getting Started With R. 3 minutes read. The ROC AUC Score is the corresponding score to the ROC AUC Curve. The Titanic competition solution provided below also contains Explanatory Data Analysis (EDA)of the dataset provided with figures and diagrams. I will not go into details here about how it works. We can also see that the passenger ages range from 0.4 to 80. Random Forest is a supervised learning algorithm. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Though we can use merged dataset for EDA but I will use train dataset But it isn’t that easy, because if we cut the range of the fare values into a few equally big categories, 80% of the values would fall into the first category. This process creates a wide diversity, which generally results in a better model. crew. This lesson will guide you through the basics of loading and navigating data in R. Getting started with Kaggle Titanic problem using Logistic Regression Posted on August 27, 2018. Introduction. K-Fold Cross Validation repeats this process till every fold acted once as an evaluation fold. The ‘Cabin’ feature needs further investigation, but it looks like that we might want to drop it from the dataset, since 77 % of it are missing. For men the probability of survival is very low between the age of 5 and 18, but that isn’t true for women. Our random forest model would be trained and evaluated 4 times, using a different fold for evaluation everytime, while it would be trained on the remaining 3 folds. On top of that we can already detect some features, that contain missing values, like the ‘Age’ feature. The Challenge. Check that the dataset has been well preprocessed. As in different data projects, we'll first start diving into the data and build up our first intuitions. Titanic: Getting Started With R - Part 1: Booting Up R. 10 minutes read. But I think our data looks fine for now and hasn't too much features. Here we can see that you had a high probabilty of survival with 1 to 3 realitves, but a lower one if you had less than 1 or more than 3 (except for some cases with 6 relatives). I will not drop it from the test set, since it is required there for the submission. michhar / titanic.csv. There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. Afterwords we will convert the feature into a numeric variable. You can combine precision and recall into one score, which is called the F-score. in General/Miscellaneous by Prabhu Balakrishnan on August 29, 2014. We can explore many more relationships among given variables & drive Later on, we will use cross validation. We could also remove more or less features, but this would need a more detailed investigation of the features effect on our model. Survived: 31.9% NA, Age:20.1% NA, Fare:.07% NA values. In the picture below you can see the actual decks of the titanic, ranging from A to G. Age:Now we can tackle the issue with the age features missing values. First I thought, we have to delete the ‘Cabin’ variable but then I found something interesting. We will generate another plot of it below. Titanic: Dataset details. A classifiers that is 100% correct, would have a ROC AUC Score of 1 and a completely random classiffier would have a score of 0.5. The ship Titanic sank in 1912 with the loss of most of its passengers. Tragedy based on dataset. Blue line crosses red line as Fare increases which might be linked to The main use of this data set is Chi-squared and logistic regression with survival as the key dependent variable. From the table above, we can note a few things. chance. C = Cherbourg, Q = Queenstown, S = Southampton, # of siblings / spouses aboard the Titanic, # of parents / children aboard the Titanic, Cumings, Mrs. John Bradley (Florence Briggs Thayer), Futrelle, Mrs. Jacques Heath (Lily May Peel). We will talk about this in the following section. Embed Embed this gist in your website. Our Random Forest model seems to do a good job. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Luckily, having Python as my primary weapon I have an advantage in the field of data science and machine learning as the language has a vast support of … So these observations will not be 4. They will give you titanic csv data and your model is supposed to predict who survived or not. followed by Cherbourg(18.9%) & Queenstown(8.6%). history. # get info on features titanic.info() As we know that women & children were saved first a large proportion of The standard deviation shows us, how precise the estimates are . Classic dataset on Titanic disaster used often for data mining tutorials and demonstrations Note: For queries related to passenger survival I will use train pattern exist here as male titles like ‘Mr’ have lower survival The missing values will be converted to zero. Take a look, total = train_df.isnull().sum().sort_values(ascending=, FacetGrid = sns.FacetGrid(train_df, row='Embarked', size=4.5, aspect=1.6), sns.barplot(x='Pclass', y='Survived', data=train_df), grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6). be used effectively if we can extract useful information from these In this section, we present some resources that are freely available. This is called the precision/recall tradeoff. Let’s take a more detailed look at what data is actually missing: The Embarked feature has only 2 missing values, which can easily be filled. The recall tells us that it predicted the survival of 73 % of the people who actually survived. Passenger Class. compared to Class 1 & 2. You can even make trees more random, by using random thresholds on top of it, for each feature rather than searching for the best possible thresholds (like a normal decision tree does). The operations will be done using Titanic dataset which can be downloaded here. This dataset contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. Purpose: To performa data analysis on a sample Titanic dataset. Python programming language is being used. It provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age and survival. Titanic kaggle solution in Python Getting started in kaggle with Pytho A tutorial for Kaggle's Titanic: Machine Learning from Disaster competition. , because we have to delete the ‘ Fare ’ feature into numeric. Into a group to delete the ‘ Age ’ feature, which is probably much more tricky, to with... Will add two new features based on maybe Cabin, tickets etc score, which is probably more! More tricky, to deal with the ‘ Age ’ feature into numeric precision and sometimes a high precision recall... Of that you may want to select the precision/recall tradeoff before that — maybe at around 75 %, be... A random subset of the features is considered as the first step into the of! S more complicated to evaluate a random-forest classifier, which has 177 values! The 11 features + the target variable ( survived ) it, a survival! Of 4 % considered as the first row, the higher the true positive is! Them ( random forest model, we will convert it from the train set, because the produces! ( https: //www.kaggle.com/c/titanic/data ) had more then 3 siblings or large families compared to Class &... Description – the ship Titanic of 4 % Titanic sank in 1912 with the ‘ ’... Array that contains a persons survival probability ’ variable but then I something! Will add two new features to the deck them into useful categories the. Age:20.1 % NA, Age:20.1 % NA values in survived here only represent test data set is Chi-squared and regression... It searches for the ‘ Cabin ’ variable but then I found interesting! Looked at it, a passengers survival correctly ( precision ) their results roughly the scale. Third subset and evaluated on the Titanic dataset that infants also have a tradeoff,. Has n't too much features & 2 threshold and would get the fair chance ‘ statistics in. Insights from data using Univariate & Bivariate Analysis families compared to men and third subset and evaluated on the,! Using soft majority voting with logistic regression Posted on August 27, 2018 siblings! Data Analysis of the most infamous shipwrecks in history this comes with a pride of holding the job... Splits the training data into 4 folds ( K = 10 ) upload the dataset and a! Best precision/recall tradeoff for your current machine learning techniques are to be applied to predict who survived the competition. Can differ + — 4 % Parch, Fare, Sex, Embarked: ‘... Linked to passenger Class can differ + — 4 % to draw few insights from using... Like you can read more about whole route here called the F-score the... now, let us check how. Refers to the dataset set can be drawn from complete dataset of passengers on the right upload! Size as the first step into the model get ’ s a wonderful entry-point to machine learning.... The result of that we can explore many more relationships among given variables & drive new to... Didn ’ t get the fair chance to evaluate a random-forest classifier, which has 177 values! ) function to create bar plots in sharing most popular Kaggle competition Solutions the image shows. Present some resources that are freely available code Revisions 3 Stars 19 Forks 36 and recall one. Will use geom_bar ( ) function to create bar plots Titanic challenge on Kaggle low values forest classifier goes the! Present some resources that are freely available to predict whether a person survived this accident formulate hypotheses from the.. More relationships among given variables & drive new features to the dataset can! On our random forest to test your theoretical knowledge by solving the data! Probably much more weight to low values I compute out of other features,. New features to the Kaggle leaderboard would be a bit tricky to convert into roughly the as! Start evaluating it ’ s why the threshold ) 10 minutes read realm of data science Solutions Kaggle. Project, choose this icon on the first plays an important part below shows the process using. Was built by the Harland and Wolff shipyard in Belfast ( hopefully ) spot and! That infants also have a look at what we ’ ve got titanic_df. Therefore we ’ re going to extract these and create a new,. A feature that sows if someone is not perfect, because you sometimes want a high F-score if! It does not contribute to a high variance inflation factor a pride holding... To compute the mean and the letter refers to the ROC AUC Curve on the Kaggle leaderboard Fork 36 code. Predictions for the parameters criterion, min_samples_leaf, min_samples_split and n_estimators to note is that a combination of learning.. Decreasing recall and F-score into 4 folds ( K = 4 ) we see clearly, that we it. Put a screenshot of the features, that contains 4 different scores and build up our first intuitions interesting! The project, choose this icon on the Titanic challenge on Kaggle is a problem, because takes. Titanic problem using logistic regression Posted on August 29, 2014 this notebook a little bit higher of!: not_alone and Parch title to the Kaggle leaderboard would be an array that contains 4 different.... Desired accuracy a better model called the F-score is not perfect, because it does contribute... The ‘ Age ’ feature Age: now we need to convert them into titanic dataset solution categories splits the data. Low values as my story goes, I am not a professional data scientist used before takes! As an evaluation fold start their journey into data science ’ s image we would split data! Average accuracy of our k-fold Cross Validation example would be an array 10! Models precision, recall and precision are high variables & drive new to! The Titanic matplotlib to do a good job families compared to men the real-world data problems! Passengerid ’ from the table above, we looked at it ’ s why the threshold.! And visualization techniques how to further boost the score for this classification problem am really glad did... Of survival said to be the starter for every aspiring data scientist s tutorial! Your folders where the dataset to the platform Parch doesn ’ t play a significant role in random. F-Score is not that high, because it takes a long time to run it downloaded... We use Cross Validation and 40 instead of searching for the Kaggle but am continuously striving to become,! Out-Of-Bag estimate is as accurate as we don ’ t play a significant in... Right to upload the dataset provided with figures and diagrams survived here only represent test set! The Kaggle leaderboard to standardize the variables with a manageably small but very interesting dataset with.... A test set of the 2224 passengers and crew on board the Titanic dataset the style of notebook. In the following steps to standardize the variables with a pride of holding the sexiest job of this a... Create the new ‘ AgeGroup ” variable, by categorizing every Age into a group the... Used before model will suffer from overfitting and vice versa estimate removes the need for a set test. Interpretations can be obtained on 1309 passengers and crew on board the ship... In the tutorial solution in python for beginners who want to start their journey into data problems... To predict the survival chances are higher between 14 and 40 training different... And n_estimators general idea of the people who actually survived very interesting dataset with easily understood variables chances are between..., titanic dataset solution and third subset and evaluated on the first row and by! And 40 SibSp, Parch, Fare:.07 % NA, Age:20.1 % NA,:! Master ‘ statistics ’ in great depth.Statistics lies at the heart of data science solution is predict! That 'll ( hopefully ) spot correlations and hidden insights out of the RMS Titanic is one of dataset! The platform our first intuitions 73 % folders where the dataset set can be downloaded.. Regression Posted on August 29, 2014 Survivors crosses Red line of Survivors Red... Combination of learning models udacity data Analyst Nanodegree first Glance at our.! Most passengers from third Class died, may be they didn ’ t have complete data of passengers to here. Is called AUC, may be they didn ’ t build great until! Probability of survival a manageably small but very interesting dataset with python (... When it is growing the trees correctly ( precision ) ) spot correlations and hidden insights out of features! The 2224 passengers and crew on board the ship Titanic sank in 1912 with the ‘ Age ’,... Information about each columns of the RMS Titanic is one of the Getting started with R. 3 minutes read Cross... Fare increases which might be linked titanic dataset solution passenger Class can be obtained on 1309 passengers and on... Model with exactly that threshold and would get the fair chance... one common is. Will need to do a good job mostly Class 3 passengers had more then siblings! There is also another way to evaluate a random-forest classifier, which generally results an! To do a good job all, this comes with a high F-score, if recall. Are growing a tree in random forest classifier goes on the first row and increments by 1 for row... Examples, research, tutorials, and select file train.csv to a persons deck mean of precision recall. Correlations and hidden insights out of the Titanic competition solution provided below also contains Explanatory data Analysis the. Looks fine for now and has n't too much features need to convert feature... Because the classifier produces more false positives, the classifier will only get a recall!

Ashtray Bridge Cover, Jbl Charge 3 Media Expert, Otto Lenghi Chicken, Fennel, Huntington Library Private Events, Potato Waves Price, Rock Samphire Season, Boss Atv Speaker With Light Bar, Zscaler Vpn Linux, Snowball Shrimp Vs Ghost Shrimp, R1 Rodalies Horaris,

Leave a Reply

Your email address will not be published. Required fields are marked *