How long will it last? Machine Learning for Survival Analysis

By Batuhan Ipekci

Elevator Pitch

Survival Analysis is a versatile framework for analyzing the duration of events. Use cases can be found in a wide spectrum including medicine, economics, social sciences, and engineering. The focus will be on the estimation of real estate liquidity leading to pricing strategies.

Description

Minutes from 1 to 5: Introduction

Survival Analysis is a versatile framework for discovering the world in unprecedented ways. It is something that any data science enthusiast must know. We can predict what makes one live longer, when unemployed people find a job, when customers decide not to buy a service anymore, when a machine needs maintenance, when children drop out of school, when a single product in stock sells, when marriages end… Our focus will be today more on how and when houses are sold, and what to do about it.

There are exciting developments in the field. However, it is still not well-known among data scientists.

Traditional methodologies in Survival Analysis are very restrictive in their assumptions which do not hold for big datasets and are not suitable for some applications. Several novel machine learning approaches are developed to fill this void. New python packages are in development.

During my talk, I first introduce you to what Survival Analysis is and try to build some intuition on the subject. Later on, different models will be compared in the example of my recent study in real estate. Lastly, I would like to conclude the talk with the available python packages.

Minutes from 5 to 15: Survival Analysis

Survival Analysis is originated in health statistics. We are actually exposed to it in our daily life frequently. Do you ever remember phrases like the following?

  • „A new study has found that smoking shortens lifespan by at least 10 years“
  • „Optimists live longer“
  • „Exercise lowers the risk of cardiovascular disease“

The statements above are all about how a variable affects an event.

In Survival Analysis, we analyze a sample of individual subjects in whether they experience an event. The event can be death or anything. Subjects can be individual people or anything. If we think on an abstract level; people die, houses are sold, machines fail, loan takers pay the money back… With the help of this framework, we can analyze any process that has some termination point.

What makes Survival Analysis special is a specific innovation on how we think about sampling. Whatever we are analyzing, not all of our subjects experience the event. Not all people having a disease dies, hence we do not know if they survived or not after we conduct the study. Alternatively, the subjects may also leave in the middle of the study and disappear for some other reason we do not know.

These exceptional sample points are thought to be “right-censored”, i.e. we do not know what happened to them. I would like to restrict the talk for only “right-censored” modeling, for the sake of simplicity. Let’s revise this issue in the example of the housing market.

Let’s say we have 10 houses in total. At time 1 a house is sold. The survival probability of the houses getting sold is 1/10. At time 2, we are left with 9 houses. 2 of them are sold, 1 of them is delisted and we do not know whether it is sold or not. The survival probability at time 2 is 2/8 and so on… The aim of the Survival Analysis is to derive the series of probabilities up to a horizon while having in mind a binary target. The function which is derived by this specific counting process is called as the ‘baseline survival curve’.

Minutes from 15 to 20: Cox Proportional Hazards

Cox proportional hazards is the traditionally prevailing model. It derives the survival function by comparing all individuals throughout a ‘risk’ measure and by multiplying the baseline survival by the exponentiated risk measure. The baseline survival curve is estimated separately by a baseline estimator. One major drawback is that it assumes the risks (hazards) of variables to the duration being proportional across time, as it is obvious from the functional form of the model. This assumption severely limits the model.

If we return to my study in real estate, the time a house remains on the market is affected by its price. Nevertheless, it would be naive to assume that the price effect on liquidity is the same across all time points. There are some periods when the markets are hot, it means that even overpriced houses may sell quickly. But in a cold market, little changes in price may inflict high effects on liquidity. The Cox proportional hazards model, in its most basic form, is not able to capture the temporal nature of the effects of variables. Moreover, it is a linear model; that means non-linearities or interactions have to be included in the model in manual ad-hoc procedures. Proportional hazards, linearity and the lack of interaction effects severely affect the predictions from a Cox proportional hazards model.

Minutes from 20 to 25: Deep Learning with DeepHit

We can get rid of those restrictive assumptions by choosing a proper machine learning model. Luckily we are getting more alternatives each day: random forests, boosting models, and deep learning models like DeepHit, Surv-Nnet, Multi-task Neural Logistic Regression.

During my study, I trained DeepHit on my data. It has a better performance in terms of ranking the properties which are going to be sold early or late, and in terms of predicting the number of weeks they would spend on the market. It enabled me to see the non-linear interactions, as you can see in the map on the screen how real estate liquidity is spatially heterogeneous in Germany. Moreover, with the help of non-proportional hazard estimates, I was able to define automated strategies for a single property: whether to overprice, until when to overprice, and the cost of overpricing in terms of the prolonged time on market. This is something impossible to do with the classical survival analysis models like Cox proportional hazards.

I picked a single house from the test data set and built a small data set for its different valuation levels. The same-house-different-valuations dataset is then predicted by the model. The prediction results gave the behavior of a particular house’s survival probability when I hypothetically change its price. Then I calculated each week’s contribution to the sale of the property, that is the hazard function which is by nature non-proportional and changing across time. Choosing an upper bound and a lower bound of the price levels (over-valuations) to compare, we can indeed catch the opportunity to overvalue during a week. You will see in the graph above how many times these opportunities to overprice appeared for the life-time of the property I have chosen. A cost measure for overpricing can be calculated by the ratio of how many opportunities appeared after the expected time of sale to the all opportunities during the life-time.

We can catch if the property in question has a higher probability to be sold even at a higher price on a week. And if it can be, we can choose to overprice it if we know it might not lose that much liquidity. Our choice depends on the cost measure which I just explained. We, real estate brokers, can choose a threshold depending on our risk appetite, and overprice all the properties that do not cost us too much liquidity. The whole process is automated.

Minutes from 25 to 30: Alternative Models and Available Python Packages

Although the research on Survival Analysis is flourishing, it is still not well-known for people who are not exposed to methods in health statistics. My aim was to get you informed and excited about it by explaining the basics and showing a use case of what can be done. There are good python packages in development so that you can experiment with them in your projects.

lifelines: An old and stable package for the implementation of Cox Proportional Hazards and Accelerated Failure Time models.

scikit-survival: A package for Gradient Boosting and Support Vector Survival models. Note that they are still assuming proportional hazards, but in a nonlinear fashion.

pysurvival: Random Survival Forest, Conditional Survival Forest and (Deep) Multi-Task Logistic Regression, as well as DeepSurv

pycox: This is the newest one and contains only the newest deep learning models. It is built on Pytorch. I have used this package extensively. DeepHit, (Neural) Multi-Task Logistic Regression, The Piecewise Constant Hazard (PC-Hazard), DeepSurv

Notes

There won’t be long formulations and derivations. Coding would be at the minimum. The goal is to inform the audience about the basics and show the path for further applications.

I have more than 2 years of experience in Survival Analysis, through a data science competition, professional work, and my Master’s thesis which is close to completion.