Introduction
Univariate Plots
Univariate Analysis
Bivariate Plots
Bivariate Analysis
Multivariate Plots
Multivariate Analysis
Final Plots and Summary
Reflection

Introduction

This analysis explores information from the Prosper Loan dataset. Udacity’s description states that the provided csv file contains data for 113,937 loans. 81 distinct attributes are included for each loan.

According to their website: “Prosper was founded in 2005 as the first peer-to-peer lending marketplace in the United States. Since then, Prosper has facilitated more than $19 billion in loans to more than 1,160,000 people.” Prosper provides its partners with API access for the development of loan and investment software clients.

I selected this dataset because it is relevant to my career. Recently, I accepted a position as a software engineer at a financial services organization. Their specialty is lending. Exploring the data in this project may improve my general understanding of lending practices and the borrowers who use my applications.

Project Description

The goal of the project is to analyze loan attributes and gain insights about the various loan characteristics. For this analysis, I will focus on a subset of the attributes and attempt to answer the following questions:

What borrower characteristics are of interest?
How does a borrower’s occupation or income affect their Prosper Rating?
Is there any correlation between other borrower and loan attributes?

Univariate Plots

First, let’s take a look at the data and evaluate single attributes of interest.

Number of Rows

## [1] 113937

Number of Columns

## [1] 81

List of Column Names

##  [1] "ListingKey"                          "ListingNumber"                      
##  [3] "ListingCreationDate"                 "CreditGrade"                        
##  [5] "Term"                                "LoanStatus"                         
##  [7] "ClosedDate"                          "BorrowerAPR"                        
##  [9] "BorrowerRate"                        "LenderYield"                        
## [11] "EstimatedEffectiveYield"             "EstimatedLoss"                      
## [13] "EstimatedReturn"                     "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."               "ProsperScore"                       
## [17] "ListingCategory..numeric."           "BorrowerState"                      
## [19] "Occupation"                          "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"            "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                    "GroupKey"                           
## [25] "DateCreditPulled"                    "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"               "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                  "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"          "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"         "InquiriesLast6Months"               
## [35] "TotalInquiries"                      "CurrentDelinquencies"               
## [37] "AmountDelinquent"                    "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"            "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"              "BankcardUtilization"                
## [43] "AvailableBankcardCredit"             "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage."  "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                   "IncomeRange"                        
## [49] "IncomeVerifiable"                    "StatedMonthlyIncome"                
## [51] "LoanKey"                             "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"          "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate" "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"            "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"         "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"       "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                          "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                 "LoanOriginationQuarter"             
## [67] "MemberKey"                           "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                 "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                  "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                   "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                 "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                       "Recommendations"                    
## [79] "InvestmentFromFriendsCount"          "InvestmentFromFriendsAmount"        
## [81] "Investors"

Borrower information like credit score and stated monthly income are of interest to me. I prepared the data for the analysis with the following actions:

A new variable, CreditScoreAverage, was calculated from the upper and lower credit scores for each loan.
CreditScoreAverage was limited to the standard FICO range, 300-850
StatedMonthlyIncome was modified by rounding the values to the nearest whole number.

Let’s drop the columns that we do not need for this exploration and review the summary again.

Number of Rows

## [1] 112974

New Number of Columns

## [1] 13

New List of Column Names

##  [1] "Term"                     "LoanStatus"              
##  [3] "ProsperScore"             "BorrowerState"           
##  [5] "Occupation"               "EmploymentStatus"        
##  [7] "EmploymentStatusDuration" "IsBorrowerHomeowner"     
##  [9] "DelinquenciesLast7Years"  "PublicRecordsLast10Years"
## [11] "IncomeVerifiable"         "StatedMonthlyIncome"     
## [13] "CreditScoreAverage"

Much better. Now we can observe the structure of the modified dataset and review the data types and values.

Data Summary & Structure

Loan Data Summary

##       Term        LoanStatus         ProsperScore    BorrowerState     
##  Min.   :12.00   Length:112974      Min.   : 1.000   Length:112974     
##  1st Qu.:36.00   Class :character   1st Qu.: 4.000   Class :character  
##  Median :36.00   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :40.87                      Mean   : 5.945                     
##  3rd Qu.:36.00                      3rd Qu.: 8.000                     
##  Max.   :60.00                      Max.   :11.000                     
##                                     NA's   :28260                      
##   Occupation        EmploymentStatus   EmploymentStatusDuration
##  Length:112974      Length:112974      Min.   :  0.00          
##  Class :character   Class :character   1st Qu.: 26.00          
##  Mode  :character   Mode  :character   Median : 67.00          
##                                        Mean   : 96.08          
##                                        3rd Qu.:137.00          
##                                        Max.   :755.00          
##                                        NA's   :6872            
##  IsBorrowerHomeowner DelinquenciesLast7Years PublicRecordsLast10Years
##  Length:112974       Min.   : 0.000          Min.   : 0.000          
##  Class :character    1st Qu.: 0.000          1st Qu.: 0.000          
##  Mode  :character    Median : 0.000          Median : 0.000          
##                      Mean   : 4.166          Mean   : 0.313          
##                      3rd Qu.: 3.000          3rd Qu.: 0.000          
##                      Max.   :99.000          Max.   :38.000          
##                      NA's   :336             NA's   :46              
##  IncomeVerifiable   StatedMonthlyIncome CreditScoreAverage
##  Length:112974      Min.   :      0     Min.   :369.5     
##  Class :character   1st Qu.:   3208     1st Qu.:669.5     
##  Mode  :character   Median :   4667     Median :689.5     
##                     Mean   :   5599     Mean   :695.5     
##                     3rd Qu.:   6778     3rd Qu.:729.5     
##                     Max.   :1750003     Max.   :849.5     
##

Loan Data Structure

## 'data.frame':    112974 obs. of  13 variables:
##  $ Term                    : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus              : chr  "Completed" "Current" "Completed" "Current" ...
##  $ ProsperScore            : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ BorrowerState           : chr  "CO" "CO" "GA" "GA" ...
##  $ Occupation              : chr  "Other" "Professional" "Other" "Skilled Labor" ...
##  $ EmploymentStatus        : chr  "Self-employed" "Employed" "Not available" "Employed" ...
##  $ EmploymentStatusDuration: int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner     : chr  "True" "False" "False" "True" ...
##  $ DelinquenciesLast7Years : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years: int  0 1 0 0 0 0 0 1 0 0 ...
##  $ IncomeVerifiable        : chr  "True" "True" "True" "True" ...
##  $ StatedMonthlyIncome     : num  3083 6125 2083 2875 9583 ...
##  $ CreditScoreAverage      : num  650 690 490 810 690 ...

We have enough information to proceed.

Due to the simplicity of individual variable plots, each selected loan attribute will be presented in a stream of consciousness exploration. Detailed commentary is provided in the Univariate Analysis section.

Let’s explore and plot the data!

Individual Variables

Number of Unique Occupations

## [1] 68

##  [1] ""                                   "Accountant/CPA"                    
##  [3] "Administrative Assistant"           "Analyst"                           
##  [5] "Architect"                          "Attorney"                          
##  [7] "Biologist"                          "Bus Driver"                        
##  [9] "Car Dealer"                         "Chemist"                           
## [11] "Civil Service"                      "Clergy"                            
## [13] "Clerical"                           "Computer Programmer"               
## [15] "Construction"                       "Dentist"                           
## [17] "Doctor"                             "Engineer - Chemical"               
## [19] "Engineer - Electrical"              "Engineer - Mechanical"             
## [21] "Executive"                          "Fireman"                           
## [23] "Flight Attendant"                   "Food Service"                      
## [25] "Food Service Management"            "Homemaker"                         
## [27] "Investor"                           "Judge"                             
## [29] "Laborer"                            "Landscaping"                       
## [31] "Medical Technician"                 "Military Enlisted"                 
## [33] "Military Officer"                   "Nurse (LPN)"                       
## [35] "Nurse (RN)"                         "Nurse's Aide"                      
## [37] "Other"                              "Pharmacist"                        
## [39] "Pilot - Private/Commercial"         "Police Officer/Correction Officer" 
## [41] "Postal Service"                     "Principal"                         
## [43] "Professional"                       "Professor"                         
## [45] "Psychologist"                       "Realtor"                           
## [47] "Religious"                          "Retail Management"                 
## [49] "Sales - Commission"                 "Sales - Retail"                    
## [51] "Scientist"                          "Skilled Labor"                     
## [53] "Social Worker"                      "Student - College Freshman"        
## [55] "Student - College Graduate Student" "Student - College Junior"          
## [57] "Student - College Senior"           "Student - College Sophomore"       
## [59] "Student - Community College"        "Student - Technical School"        
## [61] "Teacher"                            "Teacher's Aide"                    
## [63] "Tradesman - Carpenter"              "Tradesman - Electrician"           
## [65] "Tradesman - Mechanic"               "Tradesman - Plumber"               
## [67] "Truck Driver"                       "Waiter/Waitress"

Univariate Analysis

What is the structure of your dataset?

There are 113,937 loans in the dataset with 81 unique characteristics. Although 13 loan attributes will be utilized for this analysis, the following data ranges (low to high) are relevant:

Prosper Scores: 1-11
Average Credit Scores: 9.5-889.5 (limited to 300-850)

Details:

The most common Prosper Scores are 4, 6, and 7.

Most borrower credit scores average somewhere between 690-730. 

A significant number of borrowers in this dataset live in California.

There are 68 unique borrower occupations. 28617 occupations are categorized
as "Other". 3588 occupations are null.

Most borrowers are currently employed with a median duration of 67 months.

Most borrowers do not have public records that impact their credit. 

A small number of borrower deliquinces (4) in the last 7 years is common.

The median loan term is 36 months. 

Approximately half of the borrowers in the dataset are current homeowners.

The majority of loans in the dataset have verifiable sources of income.

What is/are the main feature(s) of interest in your dataset?

For my analysis, the main features are the attributes of interest are those that tell us more about the borrowers.

What is their occupation? What is their credit score? Is there a correlation between these characteristics and the Prosper Score?

Here are the primary characteristics:

ProsperScore: 
A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.

CreditScoreAverage: 
Aggregate credit score taken from the highest and lowest scores

EmploymentStatus: 
The employment status of the borrower at the time they posted the listing.

EmploymentStatusDuration: 
The length in months of the employment status at the time the listing was created.

Occupation: 
The Occupation selected by the Borrower at the time they created the listing.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The following secondary characteristics will be used to further explore the borrower and loan data:

BorrowerState: 
The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.

DelinquenciesLast7Years: 
Number of delinquencies in the past 7 years at the time the credit profile was pulled.

IncomeVerifiable: 
The borrower indicated they have the required documentation to support their income.

IsBorrowerHomeowner: 
A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.

LoanStatus: 
The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

PublicRecordsLast10Years: 
Number of public records in the past 10 years at the time the credit profile was pulled.

StatedMonthlyIncome: 
The monthly income the borrower stated at the time the listing was created.

Term: 
The length of the loan expressed in months.

Did you create any new variables from existing variables in the dataset?

Yes. I created an average credit score variable, CreditScoreAverage, by adding the Lower and Upper Credit Score values for each loan and dividing by two.

Of the features you investigated, were there any unusual distributions?

It highly unusual for any US citizen to attain a double-digit number of public records. The max number of public records for a borrower in this dataset is 38. That is a questionable outlier.

There is a minimum value of 9.5 for a credit score in the dataset. According to American Express, the two most commonly used credit scoring models, FICO and VantageScore, both rank credit scores on a scale from 300 to 850. I limited the dataset and excluded rows with scores outside of the 300-850 range. This is a small number of rows (less than 1% of the total number of rows), so their removal will not significantly impact the analysis.

I noticed a large number of rows with loan data from California residents. Exploring data from specific borrower states could skew the analysis so I will not continue to explore it.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes. In addition to the CreditScoreAverage subset, I dropped the columns that were not needed for my exploration. This reduced the amount of data that the compiler needs to process for each function. I also rounded all values for StatedMonthlyIncome to the nearest whole number.

Bivariate Plots

In this section, we will evaluate two characteristics in each plot.

We can see that Prosper Scores are better for borrowers with higher average credit scores. I will use the Average Credit Score to explore more characteristics in the dataset.

Employment does not guarantee that the borrower will have a high credit score. For example, part-time employees are more likely to have low credit scores. Surprisingly, Retirees (who are not likely to receive employment income) appear to have a credit score advantage.

After reviewing the employment status plots, I became curious about the relationship between stated income and employment statuses.

Another surprise. Retirees do not report a monthly income. They may have other financial resources like retirement funds or pensions but that does not seem to be reflected in the monthly income values.

There appears to be a relationship between low incomes and low credit scores. The lowest average credit scores in the dataset are associated with borrowers who report incomes below $25000 USD annually.

High incomes are visible across the median. Incomes are slightly higher in the 800-850 credit score range, but they are not as high as I anticipated. I will explore this further in the Multivariate Analysis.

Most borrowers are within the mean for average credit scores regardless of employment history.

There is a weak relationship between borrowers with a shorter employment duration and lower credit scores. A slightly stronger relationship exists for borrowers with longer employment durations and higher credit scores.

These findings align with the results in the prior plots.

Without more information it is difficult to make an inference. Is a low credit score caused by past due payments and defaulted loans? Or did the past due payments cause the low credit score?

We can clearly identify a greater occurrence of lower credit scores for borrowers of loans which were canceled or defaulted. Credit scores for Current loans vary but they are close to the median.

Completed loans and loans in their final payment stage are more likely to have some scores that are above the median. This may indicate that successful completion of a full loan term shows that a borrower is less likely to pose a risk to lenders in the future.

Are borrowers more likely to default on loans with longer terms? Nope!

I decided to display two plots for each to observe the differences between two types of visualizations for the same data.

These plots show that borrowers with high credit scores have few, if any, delinquencies or public records. We can infer that these are two factors which contribute to higher Prosper scores.

It is odd that although 1 or 2 public records decrease the borrower’s Prosper score, this is not a consistent trend.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The relationships between unique borrower characteristics, Prosper scores, and credit scores can be difficult to identify. It is often easier to determine the factors which contribute to a low credit score than a high credit score.

The creditworthiness of a borrower is calculated using multiple factors. Although they are important, income and employment are not the most significant factors. Borrowers with varying occupations and salaries can achieve a high Prosper score.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes. I was surprised to see that Retirees are more likely to have better credit scores although they do not receive a regular income from employment.

What was the strongest relationship you found?

The strongest relationship exists between the Prosper Score and Credit Score. There is a clear relationship. Higher credit scores are preferred by Prosper lenders. Borrowers with high credit scores are less likely to default on their loans which is a less risky investment for a lender.

Multivariate Plots Section

In this section, we will evaluate multiple characteristics and analyze complex relationships.

This plot shows the positive, negative, and neutral correlations between the attributes that I chose to select. The output confirms the discoveries about Prosper Scores and credit scores that were made in the Bivariate Plots section. There is a very strong correlation between Prosper scores and credit scores.

These plots use the ggpairs function. I wanted to closely examine the Delinquency and Public Record correlations. The plots validate the output of the correlation matrix and provide visualizations of the data.

The next two plots expand upon the Bivariate analysis of monthly income and employment status. We can see that borrowers with a low credit score will not receive a high Prosper Score. Length of employment and monthly income cannot change this.

The last plot combines the Prosper Score, Credit Score, and Loan Status to show a colorful array of visualizations. The correlation summary confirms and further expands upon the prior facet wrap plots. Borrowers who are either current or previously completed their loan payments have better Prosper Scores and credit scores.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The correlation matrix was especially helpful. It highlighted both strong and weak relationships in a simple visualization.

As I noted in the Bivariate Plots analysis, credit scores are better for borrowers who successfully completed their loan terms. The multivariate plots confirmed this.

Were there any interesting or surprising interactions between features?

The point plots highlighted loans that do not have Prosper Scores (NA). The associated borrower’s credit scores and salaries are typically low. This was surprising. I did not explore the NA value closely before generating this plot and had not considered it as a point (pun intended) of interest. The chosen style of the graphic and the colors made the relationship easier to identify.

Final Plots and Summary

Plot One

Description One

This simple boxplot is the core of my exploration. When I initially reviewed the Prosper Loans dataset, I noticed the Prosper Score and wondered how it was calculated. Boxplots graphically depict the symmetry of categorical and continuous data comparisons. It was a good choice for these attributes.

In this visualization, we can easily observe a strong relationship between the credit score and Prosper Score. If the borrower’s average credit score is high, the Prosper Score will be high as well. Both scores are crucial characteristics of Prosper’s risk scoring process.

Prosper Scores are derived from historical loan data collected after 2009. With additional analysis, we may be able to determine risks for borrowers with existing Prosper history vs. first time Prosper borrowers.

Plot Two

Description Two

I was excited to discover the facet wrap feature of ggplot. Facet wraps create a sequential series of graphs. Each category can be explored as a unique property or in relation to other categories.

Plots of this kind are easier for reader to quickly absorb. We can see more information about the borrowers for each unique employment type and decide if the employment status positively or negatively impacts lending risk factors like credit scoring.

Retirees and self-employed borrowers have better credit scores than I expected. Income cannot eliminate the inherent risks of lending to borrowers with who frequently miss payments or acquire public records.

Plot Three

Description Three

This matrix is simple but it explains so much about the data. This is a quick glance at all of the loan characteristics that I chose to explore and their correlations with one another.

Reflection

Loan risk scoring and creditworthiness calculation are complex subjects. The Prosper Loans dataset presents a great opportunity to explore the impact of loan and credit scores from the perspective of a borrower.

It was somewhat disappointing to discover that borrower occupations and employment statuses are poorly categorized. A large number of occupation values are either blank or categorized as “Other” and “Professional”. That may indicate that the borrower’s occupation is not a key characteristic of a Prosper risk analysis.

Requiring occupation and employment data for each loan would be a useful improvement. That data may highlight additional insights from supporting loan characteristics like long-term credit score stability. It would be fun to explore these attributes if they are available in the future.

The stated income and employment status data was the most surprising to me. Further exploration of the existing data might reveal that the Prosper Scoring model differs for Retirees and unemployed borrowers vs. borrowers with common sources of income.

Overall, this project was interesting and I enjoyed the exploration. This is my first time programming with R. I learned a lot about the language. I also gained new insights about the unique world of lending. That will certainly benefit me in my day to day work.