THE DETECTION OF A MALIGNANT TUMOR IN THE BREAST
There are many factors that can contribute to this disease. Genetics, family history, and race are quite important though uncontrollable factors. Being overweight, lack of exercise, smoking, and eating unhealthy food can also contribute to breast cancer.
There are mainly two types of breast tumor
1). Benign
2). Malignant
A tumor can be benign (not dangerous to health) or malignant (has the potential to be dangerous). Benign tumors are not considered cancerous: their cells are close to normal in appearance, they grow slowly, and they do not invade nearby tissues or spread to other parts of the body.
Malignant tumors are cancerous. Left unchecked, malignant cells eventually can spread beyond the original tumor to other parts of the body.
The term “breast cancer” refers to a malignant tumor that has developed from cells in the breast.
Stages of breast cancer
Breast cancer stage is usually expressed as a number on a scale of 0 through IV — with stage 0 describing non-invasive cancers that remain within their original location and stage IV describing invasive cancers that have spread outside the breast to other parts of the body.
The main aim of this project is to detect the malignant tumor in early stages so that the patient can get enough time for treatment and a healthy recovery.
For this, the dataset that is used has 30 different features regarding the texture, dimensionality, and shape of the tumor. Various models are used to detect whether it is Benign or Malignant.
The 30 features of the dataset are described below:
radius_mean
texture_mean
perimeter_mean
area_mean
smoothness_mean
compactness_mean
concavity_mean
concave points_mean
symmetry_mean
fractal_dimension_mean
radius_se
texture_se
perimeter_se
area_se
smoothness_se
compactness_se
concavity_se
concave points_se
symmetry_se
fractal_dimension_se
radius_worst
texture_worst
perimeter_worst
area_worst
smoothness_worst
compactness_worst
concavity_worst
concave points_worst
symmetry_worst
fractal_dimension_worst
Prediction models: These models are used for the classification of the tumor in breast
K-nearest neighbors
Logistic Regression
Decision Tree
Random Forest
Gradient Boosting
SVM
MLP Classification
K-NEAREST NEIGHBORS
K-Nearest Neighbors algorithm is used to classify the type of breast cancer. Pre-processed data is split into training and testing dataset. The prediction accuracy achieved on training dataset is 96%, whereas on testing dataset, the achieved accuracy is 92%.
The Graph below shows the relation between training and testing accuracy.
This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!
LOGISTIC REGRESSION
For Logistic Regression the results achieved are describes below.
Training set accuracy: 95.5%
Testing set accuracy: 93.7%
The graph of coefficient magnitude and features is plotted.
This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!
DECISION TREE
Accuracy on training set: 0.986
Accuracy on test set: 0.937
Feature importance graph for Decision Tree is plotted as below.
This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!
RANDOM FOREST
The accuracy achieved with RF and the feature importance graph are showed below.
Accuracy on training set: 1.000
Accuracy on test set: 0.958
This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!
GRADIENT BOOSTING
The prediction accuracy achieved on training dataset is 100%, whereas on testing dataset, the achieved accuracy is 95.8%.
This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!
MLP CLASSIFICATION
MLP classification
Accuracy on training set: 0.958
Accuracy on test set: 0.979
The plot shows the weights that were learned connecting the input to the first hidden layer. The rows in this plot correspond to the 30 input features, while the columns correspond to the 100 hidden units. Light colors represent large positive values, while dark colors represent negative values.
One possible inference we can make is that features that have very small weights for all of the hidden units are “less important” to the model. We can see that “mean smoothness” and “mean compactness,” in addition to the features found between “smoothness error” and “fractal dimension error,” have relatively low weights compared to other features. This could mean that these are less important features or possibly that we didn’t represent them in a way that the neural network could use.