What is the relationship between household wealth and rooftop solar in Australia?

Around one in three Australian homes have installed rooftop solar photovoltaics. The relationship between rooftop solar and household income or wealth has been a focus of research in Australia and in other countries. Studies in Australia using survey data tend to the conclusion that rooftop solar uptake is positively associated with wealth, although not necessarily income. We critique those studies and suggest the conclusion on the relationship between wealth and solar installation is not robust. In fact, using data from customers’ electricity bills we find a negative relationship between wealth and rooftop solar. However, we do not think our models are sufficiently robust to reach a firm conclusion. A richer dataset is needed. Continued effort in this area will be valuable considering the possibility that distributed (behind-the-meter) storage will increasingly be paired with rooftop solar. This will potentially have a large effect on the recovery of shared network costs. If it is the case that richer households are able to opt-out of increasingly expensive grid-supplied electricity, regressive impacts may become significant. Highlights • The relationship between household wealth and rooftop solar is not well understood because it is difficult to model robustly using available data. • Claims that wealth is positively related to solar uptake are not robust when data is segmented to take into account the differences in the barriers faced by owners and renters, and homes with and without shared roof space. • Analysing data from customers’ bills we find that for owner-occupied detached houses wealth and solar uptake is negatively related, although this analysis is not sufficiently robust to reach a firm conclusion.


Introduction
Across the south and eastern states of Australia, about one in three houses and small businesses (2.6m) have now installed rooftop photovoltaics. In total these systems are capable of providing around 12 GW of power at their peak on a sunny day. 1 These solar homes produce around 19 TWh of electricity each year (about 10% of end-use electrical demand). In some regions, solar panels are installed in more than one in two eligible roofs. As far as we know this level of distributed (behind the meter) solar installation is higher than in other large developed countries.
While rooftop solar was initially heavily subsidised by tax payers and electricity consumers, these subsidies declined quickly when solar costs declined and grid-supplied electricity prices rose (Mountain & Szuster, 2015). In some regions, rooftop solar continues to receive significant public subsidy, albeit means tested. 2 While rooftop solar retains political and community support, energy regulators and some community groups have increasingly drawn attention to their equity concerns. 3 Their arguments have focused, mainly, on the claim that solar homes may in future impose costs on electricity networks that those solar homes, rather than all consumers should pay for. However, some welfare advocates also contend that there is a positive relationship between wealth and rooftop solar. 4 Rooftop solar, and its continued policy support, has therefore become topical in the context of the broader debate on a "just transition" (i.e. socially progressive decarbonization policy). These issues are topical elsewhere too. For example, in the United States (Welton & Eisen, 2019) suggest that the acceleration of equitable rooftop PV has an important role to play in clean energy justice. Similar arguments can be found with respect to the Global North (Carley & Konisky, 2020) and California (Lukanov & Krieger, 2019). With the highest country-level residential rooftop PV uptake that we are aware of, the relationship between wealth and PV uptake is topical. We replicate and critique the existing literature, which use Australian Bureau of Statistics (ABS) survey data. The ABS survey data provide a rich source of information on the households' socio-economic characteristics (such as net wealth and income) but limited information on household electricity consumption, solar production or prices. We also analyse a sample of electricity bills from 6,067 houses across the south and eastern states of Australia. The bills provides a rich source of information on the prices households actually pay for grid purchases and receive for solar exported to the grid, and the amount of electricity consumed (grid consumption and self-consumption). Unlike the ABS survey data, our electricity bill data can adjust for heterogeneity in electricity prices and end-use consumption, in assessing the relationship between wealth and solar uptake.
Unlike the existing studies, our analysis of bill data supports a conclusion that wealth and solar uptake are negatively associated. However, rigorous testing of our analysis as well as the literature that relies on ABS survey data leads us to the views that strong conclusions on the relationship between wealth and solar uptake are not possible. The existing studies and ours suffer from omitted variables. Our contribution to the literature is in clarifying that the relationship between wealth and rooftop solar uptake remains uncertain in Australia, and in identifying the sort of data that is likely to support a confident conclusion.
Section 2 provides background covering relevant literature and descriptive data. Section 3 and 4 present the methodology and results, respectively. Section 5 discusses the findings. Section 6 presents the main conclusions and policy implications.

Literature Review
There is an extensive literature on the relationship between income or wealth and solar uptake, in the United States in particular. Low-to-moderate income households are less likely to adopt rooftop PV than high-income households (Carley and Konisky, 2020). However while solar adopters generally skew towards higher incomes, that trend continues to diminish over time and solar adopter incomes' vary considerably and encompass many low-to-moderate income households .
Demand-side explanations for adoption inequity include access to capital, renting rather than owning, building form, language barriers, race and ethnicity (Sunter et al., 2019). Supply-side explanations include income-targeted marketing . PV leasing and property-assessed financing have increased the diffusion of PV adoption among low and middle income households in existing markets and have driven more installations into previously underserved low-income communities .
The Australian literature on the relationship between wealth/income and solar is more limited and focussed mainly on establishing whether there is a statistically significant relationship between wealth/income and solar uptake, using survey and census data. Best et al. (Best et al., 2019b) apply logit models to ABS data from the Survey of Income and Housing (SIH) survey 2015-16 and Household Energy Consumption (HEC) survey 2012 to identify economic, social and environmental factors that affect solar uptake (and the intention to install solar). They estimate solar uptake and the intention to install solar with different combinations of the various factors including wealth. They conclude that higher net wealth is generally associated with higher solar uptake (but not the intention to install solar); that income does not influence solar uptake, but that income does affect the intention to install solar.
Best et al. (Best et al., 2021) (Best et al., 2019b) they find income is not significant using the 2015-16 SIH, but is significant (at 10% level of significance) for the 2017-18 SIH. They recognise that their results may suffer from multicollinearity based on the "substantial correlations" between net wealth and other variables. Further critique and analysis of the Best et al. (Best et al., 2019b) and Best et al. (Best et al., 2021)

follows in
Sections 3 and 4. (Best et al., 2019a) use a cross-sectional model to estimate, for each postcode in Australia, the proportion of households with solar installed and solar PV capacity. Using postcode level data, they match data on small-scale PV installations (Australian PV Institute, 2019) to ABS census data on number of dwellings, rented dwellings, flats and apartments, and household income (Australian Bureau of Statistics, 2018), as well as solar exposure data (Bureau of Meterology, 2018) and additional individual taxation income data (Australian Taxation Office, 2018). They find subsidies, location, whether the home is rented and whether the dwelling is a flat or apartment are significant drivers of solar uptake at the postcode level. Household and personal income are also a significant drivers of solar uptake, except for homes with very low income (less than A$20,799 p/a). However, Best et al. (2019a) do not include net wealth as an explanatory variable but do include superannuation as a proxy for accumulated capital. They find the proxy measure for accumulated capital is not a significant driver of solar uptake, but is a significant driver of solar PV capacity.

Data and methodology
3.1 Revisiting Best et al. (Best et al., 2019b) and Best et al. (Best et al., 2021) Best et al. (Best et al., 2021) draw on the ABS SIH 2015-16 data to examine how solar uptake varies with household wealth across Australia. The Survey of Income and Housing (SIH) is conducted biannually with around 15,000 households across Australia and collects a rich source of socio-economic data. Best et al. (Best et al., 2021) do not segment the data by any cohort. Rather, they examine the proportion of all households that have solar across the ten net wealth percentiles. However, as we will show, wealth is strongly correlated to building form and ownership. We therefore segment the ABS SIH data into different cohorts according to ownership and building form, and re-examine how solar uptake varies with wealth.
Best et al. (Best et al., 2019b) similarly do not separate the data into owned homes and rented homes but instead account for ownership by including a flag for rental properties as a dependent variable in their econometric (logit) model. However, as we will show, rent is strongly correlated with wealth: rented homes dominate the lowest wealth deciles, conversely for owned homes. Consequently the model suffers from multicollinearity. To address this, we replicated the logit model in (Best et al., 2021) using the ABS SIH 2015-16 (Australian Bureau of Statistics, 2017) for all dwellings and then apply this model to segmented datasets based on ownership and building form. We compare the significance, sign and magnitude of the estimated coefficients for each model (as applied to all dwellings, owner-occupied, rentals, other dwellings and owner-occupied houses) to assess whether the findings presented by Best et al. (2021) are robust when data are appropriately segmented.

Our Data
Australian customer group CHOICE 5 consented to our use of processed electricity bill data obtained from 10,050 unique households. These data were extracted from original PDF-format monthly or quarterly electricity bills that household electricity consumers -many of whom were members of CHOICE -had uploaded to CHOICE's website between May and November 2018, in order to make use of an online price comparison service "CHOICE TRANSFORMER". CHOICE TRANSFORMER offered repeated electricity price comparison and customer switching but required the payment of a subscription fee if available savings (by switching to cheaper offers) of more than $100 per year were found. Customers were recruited mainly through email advertising to CHOICE's members and supporters. All households were in the contestable retail electricity markets of New South Wales (NSW), Victoria (VIC), South-East Queensland (QLD) or South Australia (SA).
The bill data includes information on billing address, postcode, retailer, network service provider, state, tariff type, grid consumption in the billing period (kWh), estimated annual bill amount ($), volume of solar exported to the grid (kWh) and the solar feed-in price (cents per kWh).
The estimated annual bill calculated by CHOICE TRANSFORMER is based on the electricity prices and estimated annual consumption. We obtained data on the type of dwelling ( Households with solar are remunerated for their solar exports to the grid (at their feed-in rate) and also obtain value from their solar systems by using their solar-produced electricity to substitute for grid supply. To correctly estimate energy consumption, it is therefore necessary to estimate the total annual solar generation (this is not measured) and subtract the solar exports to find the residual amount of solar generation that is self-consumed. This is then added to the grid purchases to estimate the annual electricity consumed in solar homes. The methodology used to estimate gross solar production is explained in (Mountain et al., 2020).
The daily fixed charge levied by the household electricity supplier is not available in the CHOICE household bill dataset. We estimate the annual fixed (supply) charges as follows: NSW: $365, VIC: $402, QLD: $0.90 and SA: $0.8, based on the approximate average of the Standing Offers in 2018.
To estimate the variable price (cents per kWh) price paid for grid-supplied electricity we use the estimated annual bill, add back the solar income (the product of the solar exports and feed-in rate), deduct the supply charge and divide the resulting numerator by the estimated volume of grid purchases.
Our data allows us to account for the effect of electricity bills on households' likelihood of installing solar. To do this it is necessary to estimate the adjusted annual electricity bill (in other words, what the bill would be if the household did not have solar). This is done by multiplying the estimated annual consumption (after grossing up for the volume of self-consumed solar) by the variable price and adding back the supply charge.
The households' postcode is matched to the 2016 postcode-specific decile Index of Relative Socio Economic Advantage and Disadvantage (IRSAD). The IRSAD ranking is used as a proxy for household wealth.

Preliminary data analysis
Descriptive statistics on the sample of households used for this analysis are reported in Table 1 However, price and consumption show almost no association (-0.05).

Econometric model
Our goal is to quantify the relationship between the probability that a house has solar and the individual characteristics of the house. Our binary dependent variable ( ) is 1 (house has solar) or 0 (house does not have solar. Logit and probit models are suitable when attempting to model a dichotomous dependent variable as these methods fit a nonlinear function to that data to better enable a prediction the dependent variable equals 0 or 1 (by fitting an S-shaped curve, rather than a straight line, between outcomes 0 and 1). The logit model assumes there is a logistic distribution of errors, and the probit model assumes there is a normal distribution of errors. Logit and probit models yield similar results, however, we prefer the probit model because this enables marginal analysis of changes in the probability a household has solar (as opposed to changes in the log of the odds ratio the household has solar, as is the case for a logit model).
We model the probability a house has solar as ( = 1| , ) = 1 − (− ′ ), where is a continuous, strictly increasing function that takes a real value and returns a value ranging from 0 to 1.
It follows that ( = 0| , ) = (− ′ ). In a probit model, the value of is the cumulative distribution function of the standard normal distribution.
We estimate the parameters for ( = 1| , ) = 1 − (− ′ ) = ( ′ ) using the method of maximum likelihood. Using the CHOICE data set, the probit model to estimate the probability a house has solar installed is as follows: where: Pr� � is the probability the house has solar.
is the adjusted annual total bill (but for the installation of solar and including fixed annual charges), as described above.
is a dummy variable for homes in QLD (takes values of 1 if dwelling is in QLD, 0 otherwise).
is a dummy variable for homes in SA (takes values of 1 if dwelling is in SA, 0 otherwise).
is a dummy variable for homes in NSW (takes values of 1 if dwelling is in VIC, 0 otherwise).
ℎ refers to the ABS IRSAD (takes values from 1 to 10).
We assess the robustness of model to fit our sample data and provide reliable predictions on the relationship between wealth and solar uptake using a variety of methods. "Correct" classifications are obtained when the predicted probability is less than or equal to 0.5 and the observed y=0, or when the predicted probability is greater than 0.5 and the observed y=1. The weighted average of the percentage of times the model correctly predicts y=1 and y=0 is used to measure the overall proportion of times the model correctly predicts whether a house has solar installed. The difference in the expectation-predication results for the two models is then compared to calculate the percentage increase in prediction accuracy between the informed (our estimated probit model) and uninformed (constant probability) models. However, an important limitation to the expectation-prediction accuracy is that it relates to the ability of the model to predict all observations in the overall sample (where y=1 and y=0, based on the weighted average of correctly predicted outcomes) and does not separately measure the ability of the model to predict whether a home has solar or does not have solar. As a result, in samples (such as ours) where the data is heavily skewed towards homes that don't have solar, the measure of overall accuracy is misleading. We therefore provide the overall weighted measure of the prediction-evaluation tests, but the accuracy of the model to predict the subsets of homes with and without solar that is most relevant.
We also perform two goodness-of-fit tests of the model predictions using the Andrews (1988) and Hosmer-Lemeshow (1989)  probability of solar uptake across each decile is large, the model provides an insufficient fit of the data and should be rejected (the null hypothesis is that the model is correctly specified, the test statistic follows a 2 distribution with g-2 degrees of freedom, and (g) is the number of groupings (10)). The Andrews test performs a similar test to assess goodness of fit of the model predictions (using a similar data grouping structure). We report the p-values for both tests (a p-value at or below 0.100, 0.05 and 0.001 indicates the null hypothesis (the model is correctly specified) should be rejected at the 10, 5 and 1% level of significance, respectively).

Marginal wealth effects
Marginal wealth effects refer to the incremental change in the probability a home has solar when the household's wealth decile ranking changes by one.  Table 1). For robustness, we also examine marginal effects at the 20 th and 80 th percentile adjusted bill ($1,238 and $3,884, respectively).

Out-of-sample predictive accuracy
We investigate whether our probit model is effective at predicting solar uptake on unseen data. We also examine how the performance of our model compares to a logit model, as well as non-linear, tree-based models (Rahman & Fazle, 2011). Both the ABS SIH 2015-16 and CHOICE data sets are randomly partitioned into a training set consisting of 75% of the data and a test set consisting of the remaining 25%. Our models are trained on the training set, used to predict solar uptake for the test set, and scored with reference to the known test set.
Our comparison models are a decision tree, a random forest and a gradient boosted tree; all of which are commonly used in machine learning applications. A decision tree model (Quinlan, 1986) outputs a binary tree allowing an unseen data point to be classified by following a path from the root vertex of the tree to a leaf (a vertex of order one). A random forest (Ho, 1995) uses a collection of separately trained decision trees and classifies an unseen data point by choosing the class selected by most trees.
Gradient tree boosting (Schapire, 2003) is another way of combining several trees to form a single classifier. Random forests are able to avoid the problem of overfitting that a single decision tree sometimes suffers from. In turn, gradient tree boosting (through the use of lightgbm) typically produces a higher accuracy model than a random forest when trained on the same data. More information on these techniques can be found in the Appendix.
We fit a decision tree, a random forest, and a logit model to the training sets using scikit-learn, an open source machine learning library available for the Python programming language. We fit a gradient boosted model to the training sets using lightgbm, a gradient boosting framework available for Python that uses tree-based learning algorithms (Ke et al., 2017). We fit a probit model to the training sets using statsmodels, again an open source machine learning library available for Python.
For each model, we calculate Accuracy (the percentage of test cases the model classifies correctly) and Balanced Accuracy (the arithmetic mean of the class specific accuracies) (Liu et al., 2014). If the model performs equally well on either class, then the accuracy and the balanced accuracy will be equal. We then define a robust classification model as being one that achieves a balanced accuracy of 70% or more.
Since around one in five households in our sample have rooftop solar installed and since our training set and test set was selected randomly, this imbalance in the number of data points in each class is also reflected there too. A training set consisting of different numbers of representatives from each class may result in a model that is biased towards the majority class. As an extreme example, our model might assign every test case to the majority class and thereby achieve an accuracy equal to the percentage of test cases belonging to the majority class. In other words: an imbalanced training set may lead to inflated accuracy scores. To remedy this, we employ two techniques: 1. Oversampling the minority class in the training set using SMOTE (Synthetic Minority Oversampling Technique).
2. Choosing the balanced accuracy rather than the accuracy as our performance metric. This ensures that the performance metric values accuracy at predicting the majority and the minority classes equally (Classification -Training a Decision Tree against Unbalanced Data -Cross Validated, n.d.). (Best et al., 2019b) and Best et al. (Best et al., 2021) In Figure 1, we use the ABS SIH 2017-18 to replicate the results of Best et al. (Best et al., 2021).

Revisiting Best et al.
Consistent with the results of Best et al. (Best et al., 2021), wealth appears to be positive associated with solar uptake. However, examining the distribution of solar uptake for all dwellings across wealth decile does not allow us to isolate the impact on wealth on solar uptake holding the interrelationship between ownership, building form and wealth constant. We now proceed to segment the SIH data and examine how solar uptake, building form and ownership vary with wealth.    The ABS SIH 2017-18 data shows that the proportion of owned homes with solar is much higher than rented homes with solar (24% of owned homes have solar compared to 3% of rented homes (Australian Bureau of Statistics, 2019)). Figure 4 shows that when ownership is taken into account, the proportion of homes with solar does not vary markedly across wealth deciles (excluding the least wealthy homes in the lowest two deciles).
This difference in solar uptake among renters and owners is likely to be explained by transaction costs (high density rented properties are likely to require special arrangements to allocate solar production on a shared roof to individual renters); property form (rented properties are typically higher density and so less roof space in relation to floor space) and property rights (solar installation requires landlord approval and system ownership may be assigned to the landlord). Low solar adoption amongst renters may also result from "split incentives" (i.e. the proposition that landlords can not recover the cost of the solar system in higher rents). The notion of split incentives is however contested: (Wood et al., 2012) found no evidence of the split incentives, while Zander (2020) found the contrary.  However there is no obvious trend of proportions of solar households for flat, unit or apartments across wealth deciles. If we further segment the ABS SIH 2017-18 data into fully detached and owner occupied houses ( Figure   6), we clearly see that wealth is strongly and positively associated with the likelihood of being an owner occupier of a fully detached house. Importantly, when we segment the data by ownership and building form, there is no observable relation between wealth and solar.  In the first column of Table 2, we replicate Best et al. (Best et al., 2019b) for all dwellings (17,429). 7 Columns 2 and 3 present the results for owner-occupied dwellings (12,417) and rented dwellings (5,012). As the sample of rented dwellings is biased towards fully detached houses (60%), we also replicate Best et al. (Best et al., 2019b) for "other" dwellings (i.e. dwellings other than fully detached houses) and this cohort is dominated by renters (61%). The results for "other" dwellings are presented in column 4. In the final column, we replicate Best et al. (Best et al., 2019b) for owner-occupied houses (10,792).
The model applied to all dwellings (first column in Table 2 ) produces the second highest pseudo R 2 (0.14) and correctly estimates 83% of observations overall (comprising a 99% correct prediction of the 14,256 homes without solar and 5% correct prediction of the 3,039 homes with solar). The model also satisfies the Andrews and Hosmer-Lemeshow tests (p-values are above 0.100) indicating the model is correctly specified.
The estimated coefficients for solar uptake by owner-occupied dwellings are almost identical to those for all dwellings (second column in Table 2). However, the model does not perform as well when applied to owner-occupied dwellings only: the pseudo R 2 falls by half (down from 0.14 to 0.07), the Andrews and Hosmer-Lemeshow tests find the model is not correctly specified, and the overall proportion of correctly predicted outcomes falls to 77% (comprising 99% of correctly predicted homes without solar and 6% correctly predicted homes with solar).
The estimated coefficients for rented dwellings only (third column in Table 2) are similar for wealth, income, building form, and number of bedrooms. However, occupancy (persons, employed persons, number of dependent children), pension income and number of credit cards are no longer found to be statistically significant. As was the case with owner occupied dwellings, the model when applied to rented dwellings does not perform as well: the pseudo R 2 falls to 0.08, the model satisfies the Hosmer-Lemeshow test but fails the Andrews test (meaning we cannot be certain the model is correctly specified) and even though the expectation-prediction increases to 97% overall, the model does not accurately predict any rental home with solar within the sample (all of the 4,849 houses without solar are accurately predicted, whereas none of the 163 houses with solar are correctly predicted).
The model applied to other dwellings that are predominantly rented (fourth column in Table 2) produces substantively different results. Importantly, the estimated co-efficient on the wealth variable is not statistically significant. Although the model produces a markedly higher the pseudo R 2 77 We replicate model (5) from Table 2 (Best et al., 2019b) as this model is less likely to suffer from multicollinearity (as there is no net wealth (log) squared term) and provides the second highest pseudo R 2 . 2015-16 ABS SIH microdata are filtered in line with the methodology described in (Best et al., 2019b) (Best et al., 2021) . (0.21) and is considered a good fit of the data, the model fails the Andrews test (suggesting the model is not correctly specified).
The model applied to owner-occupied houses (fifth column) indicates wealth is a positive and statistically significant driver of solar uptake. However, the magnitude of the effect is estimated to be around 25% lower than the wealth effect estimated by the original model (first column). Furthermore, the model is an extremely poor fit of the data (pseudo R 2 0.06) and Andrews and Hosmer-Lemeshow tests both indicate the model is not correctly specified.
In summary, when the data are segmented to take into account ownership and/or building form, the Best et al. (Best et al., 2019b) model produces different results and the positive wealth effect does not necessarily hold. Furthermore, there is strong evidence to suggest the model does not provide a reasonable fit of the data and is not correctly specified. On this basis, we suggest the data and model used by Best et al. (Best et al., 2019b) do not produce sufficiently robust results to conclude there is a positive wealth effect. Notes: An ***, ** and * indicates statistical significance at the 1, 5 and 10 per cent level, respectively. P-values are reported for the Hosmer-Lemeshow and Andrews tests.

Our estimation results
The estimation results and model diagnostics for equation [1] using the CHOICE data are reported in Table 3. Adjusted bill is statistically significant and positively related to the likelihood solar is installed. McFadden's R-squared is low suggesting the model does not provide a good fit of the data. This score is comparable to the results produced by (Best et al., 2021) using a similar number of explanatory variables. The model satisfies the Andrews and Hosmer-Lemeshow tests, indicating the model is correctly specified. The classification table of the expectation-predication evaluation tests (where the probability success cutoff = 0.5) shows that our models correctly estimate 81% of the observations overall (comprising 99% of the 4,893 homes without solar and 3% of the 1,117 homes with solar). We reject the null hypothesis that wealth is a redundant variable using the LR test, confirming the relation between wealth as a driver of solar uptake.

Marginal wealth effects
On the basis of the estimated coefficients from equation [1], we estimate the probability a house has solar (assuming the house pays the 20 th , 50 th and 80 th percentile adjusted bill, and separately for each state by setting the relevant state flags to one). As Figure 7 shows, there is a strong negative relationship between household wealth and solar uptake for houses that pay the median adjusted bill (and this relationship holds for the 20 th and 80 th percentile adjusted bills also). 8 Despite differences in the probability of solar uptake across states (solar uptake is highest in QLD and lowest in VIC), the impact of wealth on solar uptake (as measured by the gradient of the line) is similar across all states. correlation coefficient between wealth decile and probability house has solar is -1.000, -0.997, -1.000 and -0.998 for QLD, Vic, SA and NSW, respectively).  Table 4. compares out-of-sample classification performance for the decision tree, random forest and gradient boosted tree with the performance of a logit and probit model using the CHOICE data set for both the original training data and the oversampled training data.   Table 5. compares out-of-sample classification performance for the decision tree, random forest and gradient boosted tree with the performance of a logit and probit model using the ABS SIH data set, filtered to include only owner occupied homes, for both the original training data and the oversampled training data. When used to train classification models, both the CHOICE and the ABS SIH data set used by Best et al. (Best et al., 2019b) lead to balanced accuracy scores that do not indicate a robust classifier. We reiterate that these results do not contradict our assessment of the goodness-of-fit of specific generalized linear models; a model can fit the available data well but be a poor classifier of unseen data.
For the ABS SIH data set, the situation in the case of the logit model is stark: both standard and oversampled models perform the same as a naïve classification model would (one classifying all unseen data as not having solar panels). Clearly, therefore, we should be reluctant to draw any firm conclusions on variable association from such a model.
We note that, in the case of the CHOICE data set, the (oversampled) probit model classifies unseen data best. The balanced accuracy score (63%) while falling short of the robust score (above 70%) is higher than the 50% achieved from the ABS SIH data. Nevertheless we conclude that drawing categorical conclusions from the generalized linear models fitted to the CHOICE data may not be appropriate.

Discussion
What is the relationship between household wealth and rooftop solar in Australia? Our assessment is that the existing literature has not adequately accounted for the relationship between property ownership, building form, wealth and solar uptake. When breaking the data into "owned", "rented", "other dwellings" and "owner occupied houses" cohorts, we observe that solar uptake is not associated with household wealth. When the modelling is applied to the segmented data, we do not find wealth to be a significant driver of solar uptake for "other dwellings". For the remaining cohorts, there is a statistically significant positive relationship between wealth and solar uptake. But the model provides a poor fit of the data, fails classification tests and is likely mis-specified. On this basis, we suggest that the model's results can not be relied upon.
Our own analysis relies on a much more parsimonious model based on data obtained from a sample of customers' bills from which we know or can estimate location (by state), end-use consumption volumes, electricity prices and wealth (proxied by IRSAD decile). This model finds a statistically significant inverse relationship between wealth and solar uptake. The model provides a comparable fit of the data and satisfies in-sample prediction tests. However, our model fails the out-of-sample classification tests, albeit only marginally. Our approach can be criticized also for not actually measuring wealth (wealth is proxied by IRSAD score of the postcode that the house is located in) and by using group (post-code) level data. While we concede our "wealth" measure is proximate, such group level criticism is only likely to undermine the plausibility of the relationship we find, if individual household wealth is systematically different to the "wealth" (as measured by IRSAD) of the post codes in which the house is located. We have no reason to suppose that such systematic difference exists.
Nevertheless the relatively poor predictive accuracy of our model suggests that a richer dataset is likely to be needed in order to confidently conclude that wealth and solar installation is negatively associated. This would include information on real-world factors that are likely to impact solar installation decisions. Such factors might include cluster/peer effects (see for example (Palm, 2017)) solar installation customer acquisition strategies that target specific areas or customer cohorts (see for example  of this in the U.S.); or local regulations that may promote or undermine solar installation (such as heritage protections or building codes). More generally, we suggest that research that is able to include some of the individual, social and information predictors -see (Alipour et al., 2020) with the consumption, production, irradiance and wealth/income is likely to be able to provide more confident estimation of the probability that households install rooftop solar. Finally, while public subsidy for rooftop solar is generally now much smaller than in the early years, untangling the effect of past subsidies will be valuable in concluding with greater certainty on the relationship between wealth and solar uptake.

Conclusion and policy implications
The relationship between wealth and solar uptake by Australian households is not well understood.
Our critique of the existing studies suggests claims that solar uptake is positively related to household wealth is not robust when data are appropriately segmented by ownership and/or building form. Our analysis of bills finds that wealth is negatively related to solar uptake in detached owner-occupied houses. However we do not consider the modelling results presented here, or in previous studies, to be adequately robust at predicting solar uptake as a function of household wealth.
Confident estimation of the relationship between solar uptake and wealth is valuable in assessing the case for means-tested policy support. This is likely to become even more important if behind-themeter batteries paired with solar become popular. This is because greater self-consumption of rooftop solar will have a significant impact on the recovery of shared grid costs. More generally, broadening the research agenda to understand better the range of demand-side and supply-side factors affecting distributed production and storage uptake will be valuable.
Below we provide additional details on the three tree-based models used in the paper. In each case we use features (either the CHOICE data set or the ABS data set) to predict the value of a response variable (solar uptake) in a process known as supervised learning. More details on all three methods can be found in (Hastie et al., 2021), or (for the mathematically inclined) (Hastie et al., n.d.).
The set of possible values for our response variable is {0,1} where 1 represents a household with solar and 0 a household without solar. Our features are one of three types: real valued (such as wealth), categorical (such as number of persons in household) and binary (such as whether the household is in Victoria or not). The set of all possible values for the features is known as the feature space ( ) and the aim of a supervised learning method is to use the training data to construct a function ̂: → {0,1} such that the value ̂ takes on an element of the test set matches the real value as often as possible.

Decision Tree
A decision tree is produced by partitioning the feature space into hyper-rectangles (a higher dimensional analogue to the two dimensional rectangle) and then constructing a function � that is constant on each hyper-rectangle. The resultant function can be visualized as a binary tree with each leaf representing one of the hyper-rectangles; this aids in interpreting the model and is one reason for its popularity.
The algorithm we have used to produce the decision tree (implemented in the scikit-learn Python package) constructs the tree iteratively. At each stage in the construction of the tree, it examines all possible ways to further partition the feature space and chooses the one with the lowest gini impurity.
Decision trees trained in this way can often be overly complex and over-fit the training data. For this reason we have also used two other tree-based methods to train a classifier.

Random Forest
A random forest consists of many decision trees and is an example of an ensemble learning method.
For an element of the test set, a decision tree ̂ outputs a predicted response, and the random forest ̂ is then defined as the response selected most often by the collection of trees.
To train the decision trees, we use a technique known as bootstrap aggregating. For example, to produce a random forest model consisting of trees using a training set of size , we produce synthetic training sets of size by randomly sampling the original training set with replacement, and then use these synthetic training sets to produce the collection of trees. This typically leads to better model performance on unseen data by decreasing the sensitivity of the model to the training set and thus decreasing the probability of over-fitting the training set.

Gradient Tree Boosting
Gradient tree boosting is another ensemble learning method built from decision trees. It builds up a model ̂ iteratively. At the first step of the iteration, we fit a decision tree ̂1 to the training set. Given the features and response of an element of the training set we can calculate the difference The values ℎ 1 � � can then be considered new response variables for the training set, and so we can fit a decision tree ̂2 to this adjusted training set. We terminate this process after the th stage to produce our final model ̂.
Gradient tree boosting algorithms typically outperform both individual decision trees and random forests in classification problems, but come at the expense of lower interpretability of the model.
The ABS Census of Population and Housing ("the census") is conducted every four to five years and provides another source of information to help explore the relationship between rooftop solar and household socio-economic characteristics including income, ownership and location (i.e. postcode) (Australian Bureau of Statistics, 2018). Based on the information from the census, the ABS also ranks each post-code to an Index of Relative Socio-economic Advantage and Disadvantage (IRSAD) that varies between 1 (lowest) to 10 (highest) (Australian Bureau of Statistics, 2016). Data on rooftop solar installations by postcode from 2001 to 2021 is available from (Clean Energy Regulator, 2021). Mapping the data on solar installations to occupied private dwellings by postcode using the ABS census data provides another avenue to explore the relation between solar and wealth.