Begin with the “census2.csv” datafile, which contains census data on various tracts in a district. The fields in the data are • Total Population (thousands) • Professional degree (percent) • Employed age over 16 (percent) • Government employed (percent) • Median home value (dollars) a) Conduct a principal component analysis using the covariance matrix (the default for prcomp and many routines in other software), and interpret the results. How much of the variance is accounted for in the first component and why is this? b) Try dividing the MedianHomeValue field by 100,000 so that the median home value in the dataset is measured in $100,000’s rather than in dollars. How does this change the analysis? c) Compute the PCA with the correlation matrix instead. How does this change the result and how does your answer compare (if you did it) with your answer in b)? d) Analyze the correlation matrix for this dataset for significance, and also look for variables that are extremely correlated or uncorrelated. Discuss the effect of this on the analysis. e) Discuss what using the correlation matrix does and why it may or may not be appropriate in this case.
—
Start with the “census2.csv” datafile, which provides census data for different tracts within a district. • Total Population is one of the fields in the data (thousands) • A professional qualification (percent) • Employed since the age of 16 (percent) • Government workers (percent) • Average house price (dollars) a) Use the covariance matrix (the default for prcomp and many other software routines) to do a principal component analysis and interpret the findings. What percentage of the variance is accounted for by the first component, and why? b) Subtract 100,000 from the MedianHomeValue field to get the median home value in the dataset in $100,000s instead of dollars. How does this change the analysis? c) Instead, compute the PCA using the correlation matrix. What effect does this have on the outcome?