Index ¦ Archives

CAPSTONE PROJECT : Factors that have more incidence in the breast cancer diagnosis after a regular mammogram.

Logo

Overview

Breast cancer is the most common cause of deaths from cancer among women in the United States.

According to the American Cancer Society statistics 2016, it is estimated that almost 1.7 million new cases of cancer will be diagnosed in 2016. Prostate cancer is the most common cancer among males (21%), followed by lung (14%) and colorectal (8%) cancers. Among females, breast (29%), lung (13%), and colorectal (8%) cancers are the most common

Logo

In terms of cancer mortality. Lung cancer is by far the leading cause of cancer death among males (27%), followed by prostate (8%) and colorectal (8%) cancers. Among females, lung (26%), breast (14%), and colorectal (8%) cancers are the leading causes of cancer death.

Logo

Problem Statement

The aim of this capstone project is to determine which factors have more incidence in the breast cancer diagnosis after a regular mammogram. At the same time, I will build a model to predict the probability of breast cancer based on those factors. I will be also creating an interface to predict the probability of breast cancer after entering the factors using the model.

Dataset Background

The dataset used in the present project includes 2,392,998 screening mammograms (called the "index mammogram") from women included in the Breast Cancer Surveillance Consortium (BCSC) and results were published by Barlow et al. in the September 2006 issue of the Journal of the National Cancer Institute. The women in this dataset did not have a previous diagnosis of breast cancer and did not have any breast imaging in the nine months preceding the index screening mammogram. However, all women had undergone previous breast mammography in the prior five years (though not in the last nine months). Cancer registry and pathology data were linked to the mammography data and incident breast cancer (invasive or ductal carcinoma in situ) within one year following the index screening mammogram was assessed. There is one observation per woman, as opposed to multiple observations. All population and all other characteristics remain the same.

This dataser has 9 years of data from 2000 to 2009.

Data Dictionary

Variable Name Description Coding
year Calendar year of observation Numerical, 2000-2009
age_group_5_years Age (years) in 5 year groups
  • 1 = Age 18-29
  • 2 = Age 30-34
  • 3 = Age 35-39
  • 4 = Age 40-44
  • 5 = Age 45-49
  • 6 = Age 50-54
  • 7 = Age 55-59
  • 8 = Age 60-64
  • 9 = Age 65-69
  • 10 = Age 70-74
  • 11 = Age 75-79
  • 12 = Age 80-84
  • 13 = Age >85
race_eth Race/ethnicity
  • 1 = Non-Hispanic white
  • 2 = Non-Hispanic black
  • 3 = Asian/Pacific Islander
  • 4 = Native American
  • 5 = Hispanic
  • 6 = Other/mixed
  • 9 = Unknown
first_degree_hx History of breast cancer in a first degree relative
  • 0 = No
  • 1 = Yes
  • 9 = Unknown
age_menarche Age (years) at menarche
  • 0 = Age >14
  • 1 = Age 12-13
  • 2 = Age <12
  • 9 = Unknown
age_first_birth Age (years) at first birth
  • 0 = Age < 20
  • 1 = Age 20-24
  • 2 = Age 25-29
  • 3 = Age >30
  • 4 = Nulliparous
  • 9 = Unknown
BIRADS_breast_density BI-RADS breast density
  • 1 = Almost entirely fat
  • 2 = Scattered fibroglandular densities
  • 3 = Heterogeneously dense
  • 4 = Extremely dense
  • 9 = Unknown or different measurement system
current_hrt Use of hormone replacement therapy
  • 0 = No
  • 1 = Yes
  • 9 = Unknown
menopaus Menopausal status
  • 1 = Pre- or peri-menopausal
  • 2 = Post-menopausal
  • 3 = Surgical menopause
  • 9 = Unknown
bmi_group Body mass index
  • 1 = 10-24.99
  • 2 = 25-29.99
  • 3 = 30-34.99
  • 4 = 35 or more
  • 9 = Unknown
biophx Previous breast biopsy or aspiration
  • 0 = No
  • 1 = Yes
  • 9 = Unknown
breast_cancer_history Prior breast cancer diagnosis
  • 0 = No
  • 1 = Yes
  • 9 = Unknown
count Frequency count of this combination of covariates Numerical

Data Mining

I created Postgres Local Data Base to move the data set to a table and to be able to run sql queries on two million records for EDA purposes, and to create mini data frames for plotting and research.

Table risk_data - see column description in the above data dictionary

Logo

EDA / Findings

Logo

We can see in the above Pie Chart

  • That 22.2 % of our dataset has birads_breast_density UNKNOWN.
  • 40.7% is under birads_breast_density 1 and 2 which is considered "Benign"
  • And 37% is under birads_breast_density 3 and 4 which will require extra screening
Average birads_breast_density per year
Breast Cancer Data Population by Age

The chart above tells us that the most records are from Women on the age range 50-54. Followed by range 45-49.

We can also infer that from 40-64 women tend to do their mamogram more actively.

Breast Cancer Data Population by Race

We can see in the above Horizontal Bar Chart above that more than the half of the dataset is composed of Non-Hispanic White women patiens. 44.08% other races being lead by "Hispanic" with 13.93%

Breast Cancer Data Population by Age menarche (Age at the first Period)

Age_menarche is how old the patient was when she got her first period. It seens to be one of the factor with more incidence

Patient Cancer History by birads_breast_density
History of breast cancer in a first-degree relative (mother, sister or daughter) by birads_breast_density
Relationship between patient breast cancer and first-degree relative cancer history

Logo

This jointplot shows that most of population in concentrated on the BiRads between 2 and 3

Race Count by Brads

This last bubble chart is a visual representation of the relationship between race and the Brads classification

Risks and Assumptions

The risk of modeling is always to be able to obtain accurate relative results from predicting based on factors, such as demographics, reproductive history, medications, genetic factors (e.g., family history and susceptibility genes), and clinical and biologic markers (e.g., bmi - Body mass index). How these factors act jointly on risk also is important. These relative risk and attributable risk estimates, as well as missing or biased data has to be taken into account when reviewing the results of the present study

The Statistical techniques used to calculate risk include empirical analysis thru EDA and logistic regression as well as other classifiers that were compared based on their metric scores

Model

I compared the classifiers below. It may seem like a lot, but the purpose was to select the best classifier based on mse and std.

Logo

I trained and tested on a hold-out sample (20% of the dataset).

Each model used the following features from the data:

Features:

  • age_menarche - Age (years) at menarche
  • age_group_5_years - Age (years) in 5 year groups
  • race_eth - Race/ethnicity
  • first_degree_hx - History of breast cancer in a first degree relative
  • breast_cancer_history - Prior breast cancer diagnosis
  • age_first_birth - Age (years) at first birth
  • current_hrt - Use of hormone replacement therapy
  • menopaus - Menopausal status
  • bmi_group - Body mass index

Target:

Results based on BI-RADS breast density:

  • 0 - Negative - Low risk
  • 1 - Positive - High Risk

Logo

Evaluation data correlation

Logo

Model ROC curve and score

Logo

Confusion and precision tables

Model Comparison

Logo

Logo

Evaluating Coefficients

Logo

Results and Conclusion

I have developed a prognostication model for early breast cancer based The Breast Cancer Surveillance Consortium (BCSC) index mammogram dataset. The model predicts breast cancer probability after a mammogram is done. The probability is based on BI-RADS breast density which is part of the dataset. The performance and score have been compare among several classifiers. The best performer was GridsearchCV Bagging Logistic Regression. I picked the Logistic regression because of its simplicity

Being able to predict breast cancer outcomes more accurately would help physicians make informed decisions regarding the potential necessity of research and treatment in women patients

Next Steps

The model needs to be better calibrated with more data, and also validated with and evaluated by professionals in the area

It would be beneficial to research the BCSC dataset bank and more features to improve score

Additional Links and Sources

BCSC DATA

BCSC DATA DEFINITION

Jupiter Notebook

Link to the Google Slide Presentation Factors that have more incidence in the breast cancer diagnosis after a regular mammogram

Share on: Diaspora*TwitterFacebookGoogle+Email

© 2016 Maria Pichardo. Built using Pelican. Theme by Giulio Fidente on github.