In this whitepaper, we proposed an artificial intelligent real estate mind that can predict, forecast, & estimate values built on Artificial Neural Networks (ANN), Machine Learning and Natural Language Processing (NLP).
The other goal was to utilize some blockchain like infrastructure to create a decentralized opensource real estate MLS system. But that is part of another whitepaper and not this one.
So going on about this whitepaper, the Buyers and Sellers(mainly real estate investors looking to buy, hold, pass) will make decisions based on these predictions about properties, value, opportunity, and neighborhoods. We plan on expanding the analysis for better performance tuning and optimization for all of California, cities with or cities proposed to have Google Fiber, and Mexico. Plans to integrate speech software (Google Now, Siri, Cortana, IBM Watson, Amazon Echo) will be experimented and tested in the future.
ANN + Machine Learning + NLP are useful in modeling this capability and can be very useful in complex systems like real estate where motivations are determined by a combination of factors such as crime, school, neighborhood, jobs, cost, budget, and even emotion.
Artificial Neural Networks (ANNs) like Multilayer perceptron with
Machine learning methods like SVM (support vector machine), LSSVM (least squares support vector machine), Linear Regression, M5 Model Trees, and NaïveNeighborhoods are used to forecast real estate property values.
NLP is used to analyze unstructured data and noise.
Real Estate Value Forecasting based on Artificial Neural Networks,
Natural Language Processing
Natural Language Processing
Peter Jamack, Darrick Sogabe, Darren Kempiners,
Shirity Priya, Douglass Brown, Olu Oyedipe.
San Diego © 2016
Abstract
In this paper, we propose an artificial intelligent real estate mind that can predict, forecast, & estimate values built on Artificial Neural Networks (ANN), Machine Learning and Natural Language Processing (NLP).
Buyers and Sellers will make decisions based on these predictions about properties, value, opportunity, and neighborhoods. We plan on expanding the analysis for better performance tuning and optimization for all of California, cities with or cities proposed to have Google Fiber, and Mexico. Plans to integrate speech software (Google Now, Siri, Cortana, IBM Watson, Amazon Echo) will be experimented and tested in the future.
ANN + Machine Learning + NLP are useful in modeling this capability and can be very useful in complex systems like real estate where motivations are determined by a combination of factors such as crime, school, neighborhood, jobs, cost, budget, and even emotion.
Artificial Neural Networks (ANNs) like Multilayer perceptron with
Machine learning methods like SVM (support vector machine), LSSVM (least squares support vector machine), Linear Regression, M5 Model Trees, and NaïveNeighborhoods are used to forecast real estate property values.
NLP is used to analyze unstructured data and noise
Introduction
The main aim of this paper is to define a real estate property forecasting system based on ANN + Machine Learning + NLP. This system should be able to accurately predict real estate values, WACC, LTV and outperform every supervised learning algorithm and every real estate and brokers intuition and analysis. It will be able to do this by integrating ANN & machine learning along with scouring social media, the web, the dark web, open data, closed data, etc. for noise and nonnoise information retrieval using Natural Language Processing (NLP).
The proposed system could also help in simulating interactions, development and proposals where location choices for housing, schools or companies strongly depend on the real estate market.
The main input parameters of the proposed system are real estate pricing, sales, comparable sales, crime, neighborhood, schools, unemployment, transportation, construction costs, cash flow, rental market, economic and environmental quality related attributes.
The United States has three main real estate indexes.
The National Council of Real Estate Investment Fiduciaries Property Index (NPI) for commercial real estate, and residential real estate has Radar Logic'sRPX and the S&P CaseShiller indices.
Artificial neural networks (ANN) are constructed by the possibility of many neuron nodes and corresponding weights, in an artificial system, to simulate the neural network of humans, animals and plants. It’s very good with nonlinear characteristics; therefore ANN can simulate the nonlinear functions. However, accuracy is low and performance may depend on powerful computing by GPUs. For certain analysis, it is not very ideal. This is why combining Machine Learning Algorithms with ANN is a more ideal integrated solution.
The paper is organized as follows.
Section 1 is a literature overview, Sections 2 & 3 the real estate and algorithmic models are presented. In section 46, the proposed Artificial Neural Network (ANN) and Machine Learning and NLP models are defined using datasets from San Diego. This paper experiments with the impact of such key real estate attributes and neighborhood elements including sales price, historical trends, unemployment, school ratings, crime statistics, comparable neighborhood prices, and economic factors. In section 7, the results are discussed. In the last section (8), conclusions are carried out and validate the opportunity presented by ANN + Machine Learning + NLP for Real estate forecasting and estimating.
 Literature
Artificial Neural Networks (ANNs) have the ability to learn, generalize results, respond to and predict adequately to incomplete or unknown data. (Shaw, 1992). ANN methodology was developed to capture functional forms, allowing the uncovering of hidden nonlinear relationships between variables.
ANN represents a subfield of computer science concerned with the use of computers in tasks that are normally considered
Information knowledge and cognitive abilities (Gevarter, 1985).
It has been applied to the property price forecasting in recent years (Lai Piying, 2011). Borst (1991) has defined a great number of variables in his network to appraise real estate in New York State, demonstrating that ANNs are able to predict the real estate price with 90% accuracy.
ANNs perform better than multivariate analysis, since networks are nonlinear. They can also evaluate subjective information, such as the schools, neighborhood, crime, unemployment, transportation, fun, and the characteristics of the environment, which are difficult to incorporate into traditional mathematical approaches.
SVM Methods, which were founded on statistical learning theory, were developed in the 1990s to find a global optimized solution utilizing & solving quadratic programming problems. However with more data points, it leads to higher complexities. It offers strong learning & generalization abilities and is used mainly for classification and regression problems.
LSSVM has improved the results of SVM by changing the inequality constraints in SVM. The original quadratic programming problem becomes a problem to solving system of linear equations. LSSVM reduces parameter adjustments, reduces the complexity of the SVM calculation and also improves the efficiency of calculation. However, LSSVM loses the sparse characteristics of SVM
Natural Language Processing (NLP) was developed in the 1950s, even earlier, and initially started with Alan Turing’s paper (“Computing Machinery & Intelligence”) and became known as the ‘Turing Test.’ Most recently it’s focused around supervised and unsupervised learning methods.
 Real Estate Forecasting Model(s)
What is a reasonable price or return before even looking at properties?
There exists a formula like the Weighted Average Cost of Capital (WACC).
The WACC takes into account leverage and risk to calculate the required equity return, r(e).
r(e) = [r(p)  (LTV) * r(D)] / (1LTV)
LTV is Loan to value ratio of mortgage
r(D) is interest rate on the loan
r(p) is the real property return (78 percent on avg)
So if you set LTV to 80% and interest rate at 5%, WACC equation calculates required equity return, r(e), at 20%
r(e) = [0.08  (0.80) *0.05] / (1  0.80) = 0.2 or 20 percent
We also have to Remember Time Costs Money.
Present Value Formula (aka Present worth)
P = F/(1+i)n
 P is the present value or worth of the object in question
 F is a future payment or cost
 i is the rate of return or discount
 n is the number of time periods (years or months) considered
An example would be what would $100K be worth in 5 years?
PVF doesn’t need an interest rate, but the yearly average inflation (2%) as a discount rate. The calculation is in years, n is 5.
P = 100,000/(1+0.02)5]
P = $90,573
The following Expenses play an important factor in the return and therefore, should be added to the model to create a robust and safe estimation.
 Acquisition (Before Purchase)
 Property inspection
 Environmental inspection
 Closing costs at purchase (23%)
 Loan origination fee from lender
 Discount points on loans interest
 Credit report fees
 Appraisal fee (on 80 percent LTV value, not purchase price)
 Mortgage insurance application costs
 Mortgage broker fees
 Real estate broker/agent fees
 Real Estate taxes
 Repair & Renovation
 Flipping (Accounted for each month till sale)
 Mortgage payment
 Repairs & Remodeling costs
 Think minor painting, cleaning, landscaping
 Landscaping costs
 Utilities
 Insurance
 Real Estate brokers fee at sale
 Real estate taxes
 Sale
 Timing and Loss
 If it takes 3 months to sell
 If it takes 6 months, 12 months
Net Present Value (NPV) is sum of future cash flows minus the purchase price. Take the Time series of cash flows and discount them (expenses out and income) then add them up. This is the Present value.
Subtract the purchase price and you get the Net Present Value.
NPV = sum[Fn/(1+i)n + … + Ft/(1+i)t]  Project Costs
It looks similar to PVF.
 F = Expenses or cash flow for that particular period of time.
 (If under a year, use months)
 n = represents each period of time (2nd month would be a 2)
 n = 1 for Starting Period (when you first purchased property)
 Expenses or other due diligence costs can be added to purchase price to represent total project costs
 i = it represents the Opportunity Cost of capital here.
 Return expectations for project
 Example
 5 month project multiple 20% yearly hurdle (5/12).
 If you wish to make the hurdle (20% overall)
 Divide hurdle by project months
 20% / 5 months
 t = period sale takes place
 Refine model by adding mortgage costs
 Input monthly payments as expenses
 Subtract payoff balance from sale
NPV formula example
 Bought a $100K fixerupper
 20% down ($20K)
 3% closing costs ($3k)
 $500 painting and landscaping costs
 $700 per month mortgage and other costs
 Sell it in 5 months for $135K
Month

1

2

3

4

5

Equity

($20,000)

0

0

0

0

Expenses

($4,200)

($700)

($700)

($700)

($86,700)

Income

0

0

0

0

$135,000

Total

($24,200)

($700)

($700)

($700)

$48,300

 Closing costs and expenses add up to $6K
Compute NPV for totals, the result should be around $14,200.
It’s a Positive result and means that the project's return exceeds the investment requirements.
A good rule of thumb is to consider projects that produce a zero or positive NPV value. A NPV value of zero means the project meets your opportunity cost requirement. NPV translates into a return percentage with the internal rate of return (IRR).
IRR is the value of i, the opportunity cost of capital that will cause the NPV to calculate to a zero value. Remember, if your project is month to month the IRR value is monthly.
In our example, IRR calculates to a project return of 17% per month.
Remember this is still only modeling a potential project, so reality could change the return.
If the investment hurdle for the project was 20percent over five months, you would check to make sure the IRR exceeded (20percent divided by 5 months) 4percent per month.
The value of NPV and IRR calculation is when the time schedule changes. Stress testing or sensitivity testing are good things to add to the model. So if the project couldn’t sell for 12 months, keeping all the numbers the same would make a big impact. Mortgage payment and expenses would pile up. Time value of money comes into play.
Running NPV calculation, you find the result is a NPV of around $9,140. The project still exceeds our investment goal, but the NPV value has dropped by $5,060 from your initial calculation of $14,200.
It is around a 36% loss in project value simply by extending the project 7 months. The IRR has also dropped from the initial 90percent to 53percent.
Have a realistic schedule and stress test your models.
House Flipping Pro Forma
 
Category

Month 1

Month 2

Month 3

Month 4

Month 5

Acquisition











Property

($100,000)









Closing Costs

($6,000)









SUBTOTAL

($106,000)









Expenses











Renovations / Repairs

($500)






 
Mortgage

($525)

($525)

($525)

($525)

($525)

Lawn Maintenance

($75)

($75)

($75)

($75)

($75)

Utilities/Trash

($100)

($100)

($100)

($100)

($100)

SUBTOTAL

($700)

($1200)

($700)

($700)

($700)

Sale











Property Sale









$135,000

Sales Costs









($6,000)

CASH FLOW

($106,700)

($1,200)

($700)

($700)

$129,000

Project Returns


 
Profit

$19,700
 
OCC

20%
 
NPV

$1,100
 
IRR

20%
 
EQUITY RETURNS


 
Equity Cash Flow

($26,700)

($1,200)

($700)

($700)

$129,000

Mortgage Repayment

($80,000)
 
Cash Flow to Equity

($26,700)

($1,200)

($700)

($700)

$49,000

Equity Returns


 
Profit

$19,700
 
Req'd. Equity

($29,300)
 
OCC

20%
 
NPV

$11,700
 
IRR

70%

Leverage
Leverage is thought to be more of a Wall Street or fiancé feature, but it can exist in Real Estate and offer higher returns over all cash deals. If a project were an all cash deal in the example, $109,300 would be the cost.
So with our example, it would be a net $19,700 in profits. We will assume no time value of money for simplicity, so the ratio of profit to equity is around 18percent. The actual profit might be a bit higher because all cash deals mean no loan fees or mortgage payments.
However if you used Leverage, only $29,300 was put into the project and the rest was borrowed. So you’re leveraging $29,300 on a $109,700 project. And assuming all variables are the same, it’s the same net profit as all cash. But the ratio of profit to equity is now 67percent, which is a much higher return.
Leverage increases the projects profit potential by a factor of almost four. But Leverage also comes with risks and can turn against you.
The pro forma can be thought of as a business plan for the project. Before the pro forma, it was estimating and planning via your model. And with this model, you weren’t even looking at any physical properties.
Leveraging this system means you build a model of a project that should offer a realistic representation of what project you should look for. So if $30K was all that could be invested into a project, the models showed you are limited to properties under $100K. You also learned that you would need to sell the property for $35K more than you paid for it (35% increase).
So building upon this model, you will need to search and narrow to neighborhoods with an average value of $135K and a sales price of $100K. The Model saves you time and shows you what you should look for, but you still need to run through a few stress tests and scenarios (using model and pro forma model) where it might take over a year to sell your property. You need to find out how long do you have before you start losing money or use up any cash reserves. Also play around with renovation costs. Add Contingency lines for expenses (35%) if you think that might help.
 Artificial Neuron Network Models & Data Sets
The benefit to this system is that using ANNs, there is no need to assume explicit function or processing between inputs and outputs of the paper because ANNs learn directly from the observed data.
The ANN + Machine Learning algorithms + NLP system used in this paper has been trained with data gathered from the County of San Diego, which represents an expensive real estate market, but pockets of affordability.
In particular, San Diego is characterized as one of the worst cities to build wealth, one of the least affordable cities in the United States, and has a smaller job market when compared to Los Angeles, San Francisco, Seattle, or New York City.
California, as an entirety, is also rated as one of the bottom in education and school spending per pupil. However, San Diego is rated as one of the best places to retire, to work remote, and to lead a more laid back lifestyle.
 Multilayer perceptron
MLP is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one
 M5 Model Trees
M5P is a reconstruction of Quinlan's M5 algorithm for inducing trees of regression models and combines a conventional decision tree with the possibility of linear regression functions at the nodes.
 Machine Learning Models
 SVM
SVM is mainly used to solve the problems of classification of the samples of different categories and the regression of the samples. The classification problem mainly refers to seeking a hyperplane in the higher dimensional space to separate out the samples of different categories.
For SVM, the multiple classification can be solved via constructing two classifiers.
 KNearest Neighbors
 Linear Regression
 Partial Least Squares Regression
PLS can find the best function matching with the original data accordingly to minimize the sum of the squares of error. Although the independent variables have multiple correlation, All of the independent variables will be contained in the final model of PLS regression. And maximum information will be extracted from the original data, which ensures the accuracy of the model.
 Natural Language Processing Models
Modern NLP algorithms are based on machine learning, especially statistical machine learning. Many different classes of machine learning algorithms have been applied to NLP tasks. These algorithms take as input a large set of "features" that are generated from the input data. Some of the earliestused algorithms, such as decision trees, produced systems of hard ifthen rules similar to the systems of handwritten rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching realvalued weights to each input feature.
Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.
 Results & Discussion
 Conclusion
References
