Introduction
Univariate Plots
Univariate Analysis
Bivariate Plots
Bivariate Analysis
Multivariate Plots
Multivariate Analysis
Final Plots and Summary
Reflection
This analysis explores information from the Prosper Loan dataset. Udacity’s description states that the provided csv file contains data for 113,937 loans. 81 distinct attributes are included for each loan.
According to their website: “Prosper was founded in 2005 as the first peer-to-peer lending marketplace in the United States. Since then, Prosper has facilitated more than $19 billion in loans to more than 1,160,000 people.” Prosper provides its partners with API access for the development of loan and investment software clients.
I selected this dataset because it is relevant to my career. Recently, I accepted a position as a software engineer at a financial services organization. Their specialty is lending. Exploring the data in this project may improve my general understanding of lending practices and the borrowers who use my applications.
The goal of the project is to analyze loan attributes and gain insights about the various loan characteristics. For this analysis, I will focus on a subset of the attributes and attempt to answer the following questions:
What borrower characteristics are of interest?
How does a borrower’s occupation or income affect their Prosper Rating?
Is there any correlation between other borrower and loan attributes?
First, let’s take a look at the data and evaluate single attributes of interest.
## [1] 113937
## [1] 81
## [1] "ListingKey" "ListingNumber"
## [3] "ListingCreationDate" "CreditGrade"
## [5] "Term" "LoanStatus"
## [7] "ClosedDate" "BorrowerAPR"
## [9] "BorrowerRate" "LenderYield"
## [11] "EstimatedEffectiveYield" "EstimatedLoss"
## [13] "EstimatedReturn" "ProsperRating..numeric."
## [15] "ProsperRating..Alpha." "ProsperScore"
## [17] "ListingCategory..numeric." "BorrowerState"
## [19] "Occupation" "EmploymentStatus"
## [21] "EmploymentStatusDuration" "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup" "GroupKey"
## [25] "DateCreditPulled" "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper" "FirstRecordedCreditLine"
## [29] "CurrentCreditLines" "OpenCreditLines"
## [31] "TotalCreditLinespast7years" "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment" "InquiriesLast6Months"
## [35] "TotalInquiries" "CurrentDelinquencies"
## [37] "AmountDelinquent" "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years" "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance" "BankcardUtilization"
## [43] "AvailableBankcardCredit" "TotalTrades"
## [45] "TradesNeverDelinquent..percentage." "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio" "IncomeRange"
## [49] "IncomeVerifiable" "StatedMonthlyIncome"
## [51] "LoanKey" "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled" "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate" "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed" "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing" "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber" "LoanMonthsSinceOrigination"
## [63] "LoanNumber" "LoanOriginalAmount"
## [65] "LoanOriginationDate" "LoanOriginationQuarter"
## [67] "MemberKey" "MonthlyLoanPayment"
## [69] "LP_CustomerPayments" "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees" "LP_ServiceFees"
## [73] "LP_CollectionFees" "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss" "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded" "Recommendations"
## [79] "InvestmentFromFriendsCount" "InvestmentFromFriendsAmount"
## [81] "Investors"
Borrower information like credit score and stated monthly income are of interest to me. I prepared the data for the analysis with the following actions:
A new variable, CreditScoreAverage, was calculated from the upper and lower credit scores for each loan.
CreditScoreAverage was limited to the standard FICO range, 300-850
StatedMonthlyIncome was modified by rounding the values to the nearest whole number.
Let’s drop the columns that we do not need for this exploration and review the summary again.
## [1] 112974
## [1] 13
## [1] "Term" "LoanStatus"
## [3] "ProsperScore" "BorrowerState"
## [5] "Occupation" "EmploymentStatus"
## [7] "EmploymentStatusDuration" "IsBorrowerHomeowner"
## [9] "DelinquenciesLast7Years" "PublicRecordsLast10Years"
## [11] "IncomeVerifiable" "StatedMonthlyIncome"
## [13] "CreditScoreAverage"
Much better. Now we can observe the structure of the modified dataset and review the data types and values.
## Term LoanStatus ProsperScore BorrowerState
## Min. :12.00 Length:112974 Min. : 1.000 Length:112974
## 1st Qu.:36.00 Class :character 1st Qu.: 4.000 Class :character
## Median :36.00 Mode :character Median : 6.000 Mode :character
## Mean :40.87 Mean : 5.945
## 3rd Qu.:36.00 3rd Qu.: 8.000
## Max. :60.00 Max. :11.000
## NA's :28260
## Occupation EmploymentStatus EmploymentStatusDuration
## Length:112974 Length:112974 Min. : 0.00
## Class :character Class :character 1st Qu.: 26.00
## Mode :character Mode :character Median : 67.00
## Mean : 96.08
## 3rd Qu.:137.00
## Max. :755.00
## NA's :6872
## IsBorrowerHomeowner DelinquenciesLast7Years PublicRecordsLast10Years
## Length:112974 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median : 0.000 Median : 0.000
## Mean : 4.166 Mean : 0.313
## 3rd Qu.: 3.000 3rd Qu.: 0.000
## Max. :99.000 Max. :38.000
## NA's :336 NA's :46
## IncomeVerifiable StatedMonthlyIncome CreditScoreAverage
## Length:112974 Min. : 0 Min. :369.5
## Class :character 1st Qu.: 3208 1st Qu.:669.5
## Mode :character Median : 4667 Median :689.5
## Mean : 5599 Mean :695.5
## 3rd Qu.: 6778 3rd Qu.:729.5
## Max. :1750003 Max. :849.5
##
## 'data.frame': 112974 obs. of 13 variables:
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : chr "Completed" "Current" "Completed" "Current" ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ BorrowerState : chr "CO" "CO" "GA" "GA" ...
## $ Occupation : chr "Other" "Professional" "Other" "Skilled Labor" ...
## $ EmploymentStatus : chr "Self-employed" "Employed" "Not available" "Employed" ...
## $ EmploymentStatusDuration: int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : chr "True" "False" "False" "True" ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years: int 0 1 0 0 0 0 0 1 0 0 ...
## $ IncomeVerifiable : chr "True" "True" "True" "True" ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ CreditScoreAverage : num 650 690 490 810 690 ...
We have enough information to proceed.
Due to the simplicity of individual variable plots, each selected loan attribute will be presented in a stream of consciousness exploration. Detailed commentary is provided in the Univariate Analysis section.
Let’s explore and plot the data!
## [1] 68
## [1] "" "Accountant/CPA"
## [3] "Administrative Assistant" "Analyst"
## [5] "Architect" "Attorney"
## [7] "Biologist" "Bus Driver"
## [9] "Car Dealer" "Chemist"
## [11] "Civil Service" "Clergy"
## [13] "Clerical" "Computer Programmer"
## [15] "Construction" "Dentist"
## [17] "Doctor" "Engineer - Chemical"
## [19] "Engineer - Electrical" "Engineer - Mechanical"
## [21] "Executive" "Fireman"
## [23] "Flight Attendant" "Food Service"
## [25] "Food Service Management" "Homemaker"
## [27] "Investor" "Judge"
## [29] "Laborer" "Landscaping"
## [31] "Medical Technician" "Military Enlisted"
## [33] "Military Officer" "Nurse (LPN)"
## [35] "Nurse (RN)" "Nurse's Aide"
## [37] "Other" "Pharmacist"
## [39] "Pilot - Private/Commercial" "Police Officer/Correction Officer"
## [41] "Postal Service" "Principal"
## [43] "Professional" "Professor"
## [45] "Psychologist" "Realtor"
## [47] "Religious" "Retail Management"
## [49] "Sales - Commission" "Sales - Retail"
## [51] "Scientist" "Skilled Labor"
## [53] "Social Worker" "Student - College Freshman"
## [55] "Student - College Graduate Student" "Student - College Junior"
## [57] "Student - College Senior" "Student - College Sophomore"
## [59] "Student - Community College" "Student - Technical School"
## [61] "Teacher" "Teacher's Aide"
## [63] "Tradesman - Carpenter" "Tradesman - Electrician"
## [65] "Tradesman - Mechanic" "Tradesman - Plumber"
## [67] "Truck Driver" "Waiter/Waitress"
There are 113,937 loans in the dataset with 81 unique characteristics. Although 13 loan attributes will be utilized for this analysis, the following data ranges (low to high) are relevant:
Prosper Scores: 1-11
Average Credit Scores: 9.5-889.5 (limited to 300-850)
The most common Prosper Scores are 4, 6, and 7.
Most borrower credit scores average somewhere between 690-730.
A significant number of borrowers in this dataset live in California.
There are 68 unique borrower occupations. 28617 occupations are categorized
as "Other". 3588 occupations are null.
Most borrowers are currently employed with a median duration of 67 months.
Most borrowers do not have public records that impact their credit.
A small number of borrower deliquinces (4) in the last 7 years is common.
The median loan term is 36 months.
Approximately half of the borrowers in the dataset are current homeowners.
The majority of loans in the dataset have verifiable sources of income.
For my analysis, the main features are the attributes of interest are those that tell us more about the borrowers.
What is their occupation? What is their credit score? Is there a correlation between these characteristics and the Prosper Score?
Here are the primary characteristics:
ProsperScore:
A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
CreditScoreAverage:
Aggregate credit score taken from the highest and lowest scores
EmploymentStatus:
The employment status of the borrower at the time they posted the listing.
EmploymentStatusDuration:
The length in months of the employment status at the time the listing was created.
Occupation:
The Occupation selected by the Borrower at the time they created the listing.
The following secondary characteristics will be used to further explore the borrower and loan data:
BorrowerState:
The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
DelinquenciesLast7Years:
Number of delinquencies in the past 7 years at the time the credit profile was pulled.
IncomeVerifiable:
The borrower indicated they have the required documentation to support their income.
IsBorrowerHomeowner:
A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
LoanStatus:
The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
PublicRecordsLast10Years:
Number of public records in the past 10 years at the time the credit profile was pulled.
StatedMonthlyIncome:
The monthly income the borrower stated at the time the listing was created.
Term:
The length of the loan expressed in months.
Yes. I created an average credit score variable, CreditScoreAverage, by adding the Lower and Upper Credit Score values for each loan and dividing by two.
It highly unusual for any US citizen to attain a double-digit number of public records. The max number of public records for a borrower in this dataset is 38. That is a questionable outlier.
There is a minimum value of 9.5 for a credit score in the dataset. According to American Express, the two most commonly used credit scoring models, FICO and VantageScore, both rank credit scores on a scale from 300 to 850. I limited the dataset and excluded rows with scores outside of the 300-850 range. This is a small number of rows (less than 1% of the total number of rows), so their removal will not significantly impact the analysis.
I noticed a large number of rows with loan data from California residents. Exploring data from specific borrower states could skew the analysis so I will not continue to explore it.
Yes. In addition to the CreditScoreAverage subset, I dropped the columns that were not needed for my exploration. This reduced the amount of data that the compiler needs to process for each function. I also rounded all values for StatedMonthlyIncome to the nearest whole number.
In this section, we will evaluate two characteristics in each plot.
We can see that Prosper Scores are better for borrowers with higher average credit scores. I will use the Average Credit Score to explore more characteristics in the dataset.
Employment does not guarantee that the borrower will have a high credit score. For example, part-time employees are more likely to have low credit scores. Surprisingly, Retirees (who are not likely to receive employment income) appear to have a credit score advantage.
After reviewing the employment status plots, I became curious about the relationship between stated income and employment statuses.
Another surprise. Retirees do not report a monthly income. They may have other financial resources like retirement funds or pensions but that does not seem to be reflected in the monthly income values.
There appears to be a relationship between low incomes and low credit scores. The lowest average credit scores in the dataset are associated with borrowers who report incomes below $25000 USD annually.
High incomes are visible across the median. Incomes are slightly higher in the 800-850 credit score range, but they are not as high as I anticipated. I will explore this further in the Multivariate Analysis.
Most borrowers are within the mean for average credit scores regardless of employment history.
There is a weak relationship between borrowers with a shorter employment duration and lower credit scores. A slightly stronger relationship exists for borrowers with longer employment durations and higher credit scores.
These findings align with the results in the prior plots.
Without more information it is difficult to make an inference. Is a low credit score caused by past due payments and defaulted loans? Or did the past due payments cause the low credit score?
We can clearly identify a greater occurrence of lower credit scores for borrowers of loans which were canceled or defaulted. Credit scores for Current loans vary but they are close to the median.
Completed loans and loans in their final payment stage are more likely to have some scores that are above the median. This may indicate that successful completion of a full loan term shows that a borrower is less likely to pose a risk to lenders in the future.
Are borrowers more likely to default on loans with longer terms? Nope!
I decided to display two plots for each to observe the differences between two types of visualizations for the same data.
These plots show that borrowers with high credit scores have few, if any, delinquencies or public records. We can infer that these are two factors which contribute to higher Prosper scores.
It is odd that although 1 or 2 public records decrease the borrower’s Prosper score, this is not a consistent trend.
The relationships between unique borrower characteristics, Prosper scores, and credit scores can be difficult to identify. It is often easier to determine the factors which contribute to a low credit score than a high credit score.
The creditworthiness of a borrower is calculated using multiple factors. Although they are important, income and employment are not the most significant factors. Borrowers with varying occupations and salaries can achieve a high Prosper score.
Yes. I was surprised to see that Retirees are more likely to have better credit scores although they do not receive a regular income from employment.
The strongest relationship exists between the Prosper Score and Credit Score. There is a clear relationship. Higher credit scores are preferred by Prosper lenders. Borrowers with high credit scores are less likely to default on their loans which is a less risky investment for a lender.
In this section, we will evaluate multiple characteristics and analyze complex relationships.
This plot shows the positive, negative, and neutral correlations between the attributes that I chose to select. The output confirms the discoveries about Prosper Scores and credit scores that were made in the Bivariate Plots section. There is a very strong correlation between Prosper scores and credit scores.
These plots use the ggpairs function. I wanted to closely examine the Delinquency and Public Record correlations. The plots validate the output of the correlation matrix and provide visualizations of the data.
The next two plots expand upon the Bivariate analysis of monthly income and employment status. We can see that borrowers with a low credit score will not receive a high Prosper Score. Length of employment and monthly income cannot change this.
The last plot combines the Prosper Score, Credit Score, and Loan Status to show a colorful array of visualizations. The correlation summary confirms and further expands upon the prior facet wrap plots. Borrowers who are either current or previously completed their loan payments have better Prosper Scores and credit scores.
The correlation matrix was especially helpful. It highlighted both strong and weak relationships in a simple visualization.
As I noted in the Bivariate Plots analysis, credit scores are better for borrowers who successfully completed their loan terms. The multivariate plots confirmed this.
The point plots highlighted loans that do not have Prosper Scores (NA). The associated borrower’s credit scores and salaries are typically low. This was surprising. I did not explore the NA value closely before generating this plot and had not considered it as a point (pun intended) of interest. The chosen style of the graphic and the colors made the relationship easier to identify.
This simple boxplot is the core of my exploration. When I initially reviewed the Prosper Loans dataset, I noticed the Prosper Score and wondered how it was calculated. Boxplots graphically depict the symmetry of categorical and continuous data comparisons. It was a good choice for these attributes.
In this visualization, we can easily observe a strong relationship between the credit score and Prosper Score. If the borrower’s average credit score is high, the Prosper Score will be high as well. Both scores are crucial characteristics of Prosper’s risk scoring process.
Prosper Scores are derived from historical loan data collected after 2009. With additional analysis, we may be able to determine risks for borrowers with existing Prosper history vs. first time Prosper borrowers.
I was excited to discover the facet wrap feature of ggplot. Facet wraps create a sequential series of graphs. Each category can be explored as a unique property or in relation to other categories.
Plots of this kind are easier for reader to quickly absorb. We can see more information about the borrowers for each unique employment type and decide if the employment status positively or negatively impacts lending risk factors like credit scoring.
Retirees and self-employed borrowers have better credit scores than I expected. Income cannot eliminate the inherent risks of lending to borrowers with who frequently miss payments or acquire public records.
This matrix is simple but it explains so much about the data. This is a quick glance at all of the loan characteristics that I chose to explore and their correlations with one another.
Loan risk scoring and creditworthiness calculation are complex subjects. The Prosper Loans dataset presents a great opportunity to explore the impact of loan and credit scores from the perspective of a borrower.
It was somewhat disappointing to discover that borrower occupations and employment statuses are poorly categorized. A large number of occupation values are either blank or categorized as “Other” and “Professional”. That may indicate that the borrower’s occupation is not a key characteristic of a Prosper risk analysis.
Requiring occupation and employment data for each loan would be a useful improvement. That data may highlight additional insights from supporting loan characteristics like long-term credit score stability. It would be fun to explore these attributes if they are available in the future.
The stated income and employment status data was the most surprising to me. Further exploration of the existing data might reveal that the Prosper Scoring model differs for Retirees and unemployed borrowers vs. borrowers with common sources of income.
Overall, this project was interesting and I enjoyed the exploration. This is my first time programming with R. I learned a lot about the language. I also gained new insights about the unique world of lending. That will certainly benefit me in my day to day work.
RDocumentation Reference: https://www.rdocumentation.org/
ggplot2 Reference: https://ggplot2.tidyverse.org/reference/
diplyr Reference: https://dplyr.tidyverse.org/
ggcor Reference: https://briatte.github.io/ggcorr/
R Markdown Cookbook: https://bookdown.org/yihui/rmarkdown-cookbook/
Advanced R Style Guide: http://adv-r.had.co.nz/Style.html
RStudio Cheatsheets https://www.rstudio.com/resources/cheatsheets/
Cookbook for R - Colorblind-friendly Palette http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette
How to Read and Use Histograms in R: https://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/
Loading Data and Formatting in R: https://flowingdata.com/2015/02/18/loading-data-and-basic-formatting-in-r/
Quick R - Subsetting Data: https://www.statmethods.net/management/subset.html
Credit Score Information: https://www.americanexpress.com/en-us/credit-cards/credit-intel/credit-score-ranges/
Sample Diamonds Exploration Provided by Udacity: https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html
R Markdown Project Template provided by Udacity: https://video.udacity-data.com/topher/2017/February/58af99ac_projecttemplate/projecttemplate.rmd