Welcome to FOSP Cancer Data Project’s documentation!

Authors

  • Gisele Fernandes - Epidemiology and Statistics on Cancer Group, International Research Center, A.C. Camargo Cancer Center

  • Lucas Buk Cardoso - Research Engineer at Núcleo de Sistemas Eletrônicos Embarcados, Instituto Mauá de Tecnologia

  • Maria Paula Curado - Epidemiology and Statistics on Cancer Group, International Research Center, A.C. Camargo Cancer Center

  • Stela Verzinhasse Peres - Director of Information and Epidemiology, Fundação Oncocentro de São Paulo

  • Tatiana Natasha Toporcov - Epidemiology Department, Faculdade de Saúde Pública, Universidade de São Paulo

  • Vanderlei Cunha Parro - Coordinator of Núcleo de Sistemas Eletrônicos Embarcados, Instituto Mauá de Tecnologia

Introduction

Here is presented the project documentation using all types of cancer provided by FOSP (Fundação Oncocentro de São Paulo).

The project aims to use machine learning models to predict and identify the characteristics that most interfere in the death and survival of cancer patients in the state of São Paulo, Brazil. The data were extracted from the Deputy Directorate of Information and Epidemiology of Fundação Oncocentro de São Paulo, coordinator of the Hospital Cancer Registry of São Paulo, and contains patients treated between 2000 and 2021.

The project was developed using the python language, using machine learning models, specifically the models Random Forest and XGBoost.

Complete notebooks are available on Github.

Summary

The project will be divided in topics to provide a better understanding and organization. The steps being:

  • Libraries and Functions;

  • Data analysis, creation of new columns and first preprocessing;

  • Classifiers: with preprocessing, training and validation of machine learning models. Using five labels:

    • obito_geral;

    • obito_cancer;

    • vivo_ano1;

    • vivo_ano3;

    • vivo_ano5.

For each label, some scenarios were used to analyze the performance of the models, such as grouping the years and removing some columns from the data.

Libraries and functions