Tuesday, 16 December 2014

Even More Statistical Analysis on Bondora P2P Loans

In a number of previous posts on the BTLFinder blog I wrote about some of the analysis I did of the Bondora loan spreadsheet using basic analysis using SQL Server and Excel.

See here, here and here for the analysis.

In this article I’ll show some of the results I obtained from an analysis of the Bondora loan data using SAS Studio. SAS is a popular software application used by organisations that need to perform statistical analysis on their data sets.

Now I’ll point out that I’m NOT a statistician and although I did study one module of statistics at University, that was over 20 years ago! So excuse me if I make a complete mess of this (but if I have and you spot any glaring errors, feel free to leave comments). I have a suspicion my analysis is flawed, but when I get the hang of things I'll publish some more results. </disclaimer>

Data Preparation for Analysis

Now I’ve quickly learnt that SAS is a lot less forgiving of bad data quality than are either Excel or SQL Server. And unfortunately (according to SAS anyway) there are quite a lot of data quality issues with the Bondora dataset.

So here are the steps I took to clean up the Bondora dataset and get it into SAS:

  1. I loaded the data into Excel.
  2. In Excel I removed all loans where WasFunded = 0 and CreditDecision = 0.
  3. Then I removed all of the loans with a LoanDate within the last 3 months, just to avoid new loans from skewing the results too much.
  4. After that I removed any loan without a value for Gender, and those loans with a credit_score of 0.
  5. I then imported the data into SAS by way of a CSV file exported from Excel.

My first attempt at loading the data into SAS was not successful due to numerous import errors. I tried playing around by removing the lines it was complaining about. But in the end I succeeded by just importing selected columns of data into SAS rather than the entire file. These are just some of the columns it was happy with: IsBusinessLoan, Age, Gender, Country, credit_score, CreditGroup, LoanDuration, UseOfLoan, ApplicationType, education_id, marital_status_id, employment_status_id, Employment_Duration_Current_Employer, work_experience, occupation_area, home_ownership_type_id, AD, DefaultedOnDay, AppliedAmount and FundedAmount.

As with my previous analysis I've filtered the data so that it mainly includes information about Estonian borrowers. I know a lot of Bondora lenders tend to invest solely in these - the other countries being a little riskier. See below for why non-Estonian loans are big trouble! Thankfully SAS is able to filter data using WHERE clauses, so I could just import the whole dataset, then filter the reports.

Age of Borrower

The chart below shows the age distribution of Bondora borrowers:



I did this analysis in Excel before but what I like about SAS is that it automatically groups ranges and puts in nice distribution curves.

Anyway, this chart is nice but it's not particularly useful other than telling us that peak demand for loans is from borrowers in their late 20's and early 30's.

What's more useful is default rates plotted against age of borrower. In the chart below you can see loans by age and grouped by default status (AD=1 is for defaulted loans, AD=0 is for non-defaulted loans):



The basic message here is as with my previous analysis, I found that younger borrowers are more likely to default on their loans.

Number of Days to Default

I wanted to model loan defaults to see how long it took for the average borrower to default on their loans. I read on various Bondora forums and blogs that most borrowers default at the start of their loan.

Indeed this is true:



As you can see from the chart above, nearly 40% of borrowers who default on their loans do so within the first 160 days (i.e. 6 months).

This surprised me, as my experiences on Zopa was that my borrowers mostly defaulted late into their loan repayments.

The data in this chart have important implications for Bondora and maybe other P2P platforms that have a secondary market. It implies that it could be very worthwhile buying loans on a secondary market where you can screen out the loans by the number of repayments. I’d be interested in knowing if this is the same on platforms like Funding Circle.

And anyone buying new loans and hoping to sell them on at a profit – you could get your fingers burnt!

By the way, I mentioned at the top that a lot of lenders have got their fingers burnt on non-Estonian loans. Are the other countries really that bad? Here are default rates for the other countries:




So as you can see from these charts, for each of the countries Spain, Finland and Slovakia, if loans in these countries go bad, then they usually go bad in the first 100 days. By comparison, Estonian loan defaults peak at around the 160 day mark.

OUCH, OUCH, OUCH!

So yes, lenders are justified in being very wary of Spanish, Finnish and Slovakian loans.

One other word of caution - Bondora hasn't been so active in these other markets, so there is less historical data available compared to that available for Estonia.

Loan Defaults by Loan Amount

Whenever I've been investing in P2P platforms I've always been wary of the larger loans. For example, I used to like investing in the £3000 jewellery loans on Funding Secure and was wary of the £250,000+ loans.

But are larger loans any more risky?

According to this chart I generated for Bondora, apparently not:



This chart shows that defaults are much more likely with lower value loans. The bottom graph shows the data for loans that have defaulted (AD=1). The orange line is the one of interest, showing a spike of defaults for smaller loan amounts.

Well it's dangerous to extrapolate one chart to the entire P2P industry, but I'll definitely be happy to invest in some larger loans if they come my way.

As for the explanation, well I guess it's simply down to borrowers who borrow larger amounts of money find it easier to repay their loans. Somebody who borrows €10,000 is far more likely to be able to repay their loan than somebody who needs to borrow €500.

Summary


At the time of writing this blog post, Bondora have just announced that they've overhauled the loan credit scoring system. So I think it's back to the drawing board as far as my analysis goes.

I have a sneaking suspicion my SAS skills need a lot of work. However, it's pretty obvious from my basic analysis that non-Estonian loans are a major credit risk - especially if Bondora don't get their act together and chase defaulters as successfully as they've managed in Estonia.

If you want to repeat my analysis then I've left the SAS code below.

Have fun investing!

/*SAS Studio Bondora Analysis Code*/

/*Data import*/


proc import datafile="/folders/myfolders/LoanData.csv" out=mydata dbms=csv replace;
    getnames=yes;
run;

proc contents data=mydata;
run;

/*Number of days to default*/

ods graphics / reset width=6.4in height=4.8in imagemap;
title "Number of Days to Default";

/*--SGPLOT proc statement--*/
proc sgplot data=WORK.MYDATA;
    /*--Histogram settings--*/
    histogram DefaultedOnDay / scale=PERCENT;

    /*--Kernel Density plot settings--*/
    density DefaultedOnDay / type=Kernel;

    /*--Horizontal Axis--*/
    xaxis label="Day of Default";

    /*--Vertical or Response Axis--*/
    yaxis grid;
    discretelegend "DENSITY" / location=inside position=topright across=1;
   
    where Country = 'EE';
run;

/* Age Distribution*/

ods noproctitle;
ods select where=(lowcase(_path_) ? 'plot' or lowcase(_path_) ? 'gram');

proc univariate data=WORK.MYDATA noprint;
where Country = 'EE';
    histogram Age / kernel;
run;

/*Age by AD*/

ods noproctitle;
ods select where=(lowcase(_path_) ? 'plot' or lowcase(_path_) ? 'gram');

proc univariate data=WORK.MYDATA noprint;
    class AD;
    where Country = 'EE';
    histogram Age / kernel;
run;

/* Applied Amount*/

Title;
ods graphics on;
ods noproctitle;
**************************;
*** Test for normality ***;
**************************;
ods select TestsForNormality;

proc univariate data=WORK.MYDATA normal mu0=0;
    class AD;
    var AppliedAmount;
run;

ods select all;
**************************;
*** t Test ***;
**************************;

proc ttest data=WORK.MYDATA sides=2 h0=0 plots(only showh0)=(summaryPlot
        intervalPlot qqplot);
    class AD;
    var AppliedAmount;
run;


/*Number of Days to Default - by County*/

ods graphics / reset width=6.4in height=4.8in imagemap;
title "Number of Days to Default";

/*--SGPLOT proc statement--*/
proc sgplot data=WORK.MYDATA;
    /*--Histogram settings--*/
    histogram DefaultedOnDay / scale=PERCENT;

    /*--Kernel Density plot settings--*/
    density DefaultedOnDay / type=Kernel;

    /*--Horizontal Axis--*/
    xaxis label="Day of Default";
    title "Number of Days to Default (Estonia)";

    /*--Vertical or Response Axis--*/
    yaxis grid;
    discretelegend "DENSITY" / location=inside position=topright across=1;
   
    where Country = 'EE';
   
proc sgplot data=WORK.MYDATA;
    /*--Histogram settings--*/
    histogram DefaultedOnDay / scale=PERCENT;

    /*--Kernel Density plot settings--*/
    density DefaultedOnDay / type=Kernel;

    /*--Horizontal Axis--*/
    xaxis label="Day of Default (Spain)";
    title "Number of Days to Default (Spain)";   

    /*--Vertical or Response Axis--*/
    yaxis grid;
    discretelegend "DENSITY" / location=inside position=topright across=1;
   
    where Country = 'ES';

proc sgplot data=WORK.MYDATA;
    /*--Histogram settings--*/
    histogram DefaultedOnDay / scale=PERCENT;

    /*--Kernel Density plot settings--*/
    density DefaultedOnDay / type=Kernel;

    /*--Horizontal Axis--*/
    xaxis label="Day of Default (Finland)";
    title "Number of Days to Default (Finland)";

    /*--Vertical or Response Axis--*/
    yaxis grid;
    discretelegend "DENSITY" / location=inside position=topright across=1;
   
    where Country = 'FI';

proc sgplot data=WORK.MYDATA;
    /*--Histogram settings--*/
    histogram DefaultedOnDay / scale=PERCENT;

    /*--Kernel Density plot settings--*/
    density DefaultedOnDay / type=Kernel;

    /*--Horizontal Axis--*/
    xaxis label="Day of Default (Slovakia)";
    title "Number of Days to Default (Slovakia)";   

    /*--Vertical or Response Axis--*/
    yaxis grid;
    discretelegend "DENSITY" / location=inside position=topright across=1;
   
    where Country = 'SK'; 
run;