Please complete this problem set in R and a word processing program, including graphics and tables where appropriate. The problem set is due by email ([email protected]) at midnight on March 30, 2015. Please attach both the problem set answers and the R code ﬁle to the email.I have provided tips for the necessary R code where appropriate, and additionally remind you that everything in this problem set appears in the code ﬁles for the lectures. Feel free to use optional arguments to the functions to do things like customize graphics, if you like. Some arguments will be necessary, e.g. (, na.rm=T) where R cannot calculate a statistic because of missing data (NAs).Please feel free to ask for feedback as your make progress on the homework. My goal is for everyone to receive full credit and understand the material.1 Preamble and DataFirst, download the data [country2008] from D2L, and save it in the folder you will designate as your working directory. Second, begin your R code, starting with the preamble, and then setting your working directory and loading the data:## ## ## ## ## Georgia State University ## POLS 3800-, Introduction to Research Methods ## Problem Set 1## Load libraries library()## Set working directory setwd(“”)## Load datadat <- dget("country2008")12 Summary Statistics2.1 OverviewExamine a broad summary of your data [dim(), summary()].1. How many observations are in the dataset? What are the units of observation? Is the data cross-sectional, time-series, or both (TSCS)? How many variables are there?2. What sorts of summary statistics does summary() provide? Why do some variables have diﬀerent statistics from others?2.2 Measures of Central Tendency and Variation1. How many observations are for European countries [dat$europe]? North and South American [dat$americas]? Asian [dat$asia]? African [dat$africa]? Present your answer as a one-way frequency table; use either table() for each variable, or summary() to ﬁnd the sums. What kind of variable (in terms of level of measurement) is each regional indicator? If one drew an observation at random from the data, from what region is it most likely to come? What measure of central tendency did you use to determine this?2. What is the mean [mean()] population [dat$pop, in 1,000s] of the countries in the data? The median [median()]? Are those two answers the same for this data, or diﬀerent? Why? (Note that you may need to include the argument na.rm=T in the function if it has missing data – it will return “NA” if this is true: e.g., mean(, na.rm=T); this applies to almost all the summary statistic functions). What happens if one takes the trimmed mean of the population variable, dropping the smallest and largest two observations? Use: mean(sort()[-c(1,2,length(sort())-1,length(sort()))], na.rm=T) which sorts the variable from smallest to largest, excludes the ﬁrst two and last two observations, and takes the mean while ignoring missing data. What happens if you take the logarithm of the population variable [mean(log(dat$pop))]? Are the mean and median closer to each other? Plot and include in your answer two histograms1 of the variable, with and without a log-transformation:1Tip: “portable network graphics,” i.e. .png ﬁles, are useful formats for images because such ﬁles can be scaled in a document without suﬀering as much loss of resolution.2png("pophist1.png",width=600,height=480) # saves a graphic to the work. dir. hist() dev.off() # closes the graphical devicepng("pophist2.png",width=600,height=480) hist(log()) dev.off()What is the interquartile range of the population variable [IQR()]? Create, include, and interpret a box-whisker plot in your answer for both the normal and transformed variable:png("popboxwhisk1.png",width=480,height=600) boxplot() dev.off()png("popboxwhisk2.png",width=480,height=600) boxplot(log()) dev.off()Are there outliers in either plot? How is an outlier deﬁned in a box-whisker plot?3. What is the variance [var()] of the per capita income variables for mean and women [dat$income.m and dat$income.f]? The standard deviation [sd()]? The standard error? (For this last statistic you will need to divide the standard deviation by the square root of the number of observations [length(which(!is.na()))]). The data here is a sample. What is the 95% conﬁdence interval for the sample mean [c(mean()-qnorm(0.975)*,mean()-qnorm(0.975)*]? The 90% conﬁdence interval?3 Simple Hypothesis Tests1. Examine male per capita income [dat$income.m]. Can we reject the null hypothesis (given a signiﬁcance level α = 0.05) that the population mean for male per capita income is $12,000? $11,500? $11,000? Use the t.test() function to answer this question. What if our signiﬁcance level is 0.01?2. What is average (mean) mean age of marriage for men in the data [dat$mean.marr.m]? For women [dat$mean.marr.f]? Do a diﬀerence-in-means hypothesis test to determine the probability of observing the sample diﬀerence between mean marriage ages for each3sex given a null hypothesis that the mean marriage ages are the same (h0 : µm = µf). What is the alternative hypothesis (hA) being tested here? Is the test one-tailed or two-tailed?3. Deﬁne a p-value.4 Measures of Correlation4.1 Categorical VariablesDownload from D2L and load the Titanic data set:dat2 <- dget(‘‘titanic’’)21. Create a two-way frequency table [table()] with the sex [dat2$Sex] and survival [dat2$Survived] variables; include the table in your answer to the problem set. Can we reject the null that sex and survival were not associated (h0 : πM = πF)? How can you tell? Use both the chisq.test() and prop.test() functions. Have we proven that sex and survival were associated?2. Create a two-way frequency table with the class [dat2$Class] and survival [dat2$Survived] variables; include the table in your answer to the problem set. If our alternative hypothesis is that class aﬀects the probability of survival, what is the the null hypothesis? What is the test-statistic here? How can we interpret the p-value?4.2 Continuous Variables1. Create a scatterplot [plot(,)] with the country variable for female life expectancy [dat$lifexp.f] on the x-axis and the variable for mean marriage age for women [dat$mean.marr.f] on the y-axis; include the plot in your answer. Does it appear that the two variables are related in some way? If so, describe the apparent relationship. What is the correlation coeﬃcient (r) [cor(,, use=‘‘pairwise’’)] for these two variables? Interpret the coeﬃcient.2. Create a scatterplot with a smoothed line ﬁtting the data [scatter.smooth(,)] for female literacy and the proportion of the labor force that is female [dat$laborforce.f]. What is the correlation coeﬃcient for these variables? Is it a useful statistic here? What does it mean that the relationship between female literacy and the proportion of the labor force that is female is somewhat nonlinear?2R may not like the quotation marks as printed here; you may need to replace them if you are copying and pasting into your code ﬁle.