Data Analysis – Week 2

一、Steps in a Data Analysis

Define the question
Define the ideal data set
Determine what data you can access
Obtain the data
Clean the data
Exploratory data analysis
Statistical prediction/modeling
Interpret results
Challenge results
Synthesize/write up results
Create reproducible code

二、Data Analysis Files

（一）Data

　　1、Raw data
　　　　-Should be stored in your analysis folder
　　　　-If accessed from the web, include url, description, and date accessed in README
　　2、Processed data
　　　　-Processed data should be named so it is easy to see which script generated the data
　　　　-The processing script – processed data mapping should occur in the README
　　　　-Processed data should be tidy

（二）Figures

　　1、Exploratory figures
　　　　-Figure made during the course of your analysis, not necessarily part of your final report
　　　　-They do not need to be “pretty”
　　2、Final figures
　　　　-Usually a small subset of the original figures
　　　　-Axes/colors set to make the figure clear
　　　　-Possibly multiple panels

（三）R code

　　1、Raw scripts
　　　　-Maybe less commented (but comments help you!)
　　　　-Maybe multiple versions
　　　　-May include analysis that are later discarded
　　2、Final scripts
　　　　-Clearly commented
　　　　-Small comments liberally – what, when, why, how
　　　　-Bigger commented blocks for whole sections
　　　　-Include processing details
　　　　-Only analyses that appear in the final write-up
　　3、R Markdown files (optional)
　　　　-R markdown files can be used to generate reproducible reports
　　　　-Text and R code are integrated
　　　　-Very easy to create in Rstudio

（四）Text

　　1、Readme files
　　　　-Not necessary if you use R markdown
　　　　-Should contain step-by-step instructions for analysis
　　2、Text of analysis
　　　　-It should include a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)
　　　　-It should tell a story
　　　　-It should not include every analysis you performed
　　　　-References should be included for statistical methods

三、Getting Data

getwd() – get the working directory.
setwd(“C:\\path\\to\\directory”) – set the working directory.
download.file(fileUrl, destfile=”C:\\path\\filename.csv”) – get data from the internet.
list.file(“C:\\path”) – “C:\\path”中所有文件名.

readData <- read.table(filename, sep=””, header=TRUE) head(readData, rows) – 取得表头名称及前rows行数据. readData <- read.csv(file.choose()) – 打开文件对话框选择文件，less reproducible, but useful. 相关函数：read.csv, read.xlsx(), read.xlsx2().

con <- file("C:\\path\\to\\file", "r")

readData <- read.csv(con)

close(con)

con <- url("http://vartang.com", "r")

varData <- readLines(con)

close(con)

# 删除data.frame某行或某列：

con <- read.csv("C:\\somefile.csv")

con2 <- con[,-1:-3] # 删除第一到第三列

write.table() – The opposite of read.table().
save(), save.image – Save R objects (*.rda).
load() – Opposite of save().
ls() - Return a vector of character strings giving the names of the objects in the specified environment.
rm(list=ls()) – Remove everything from the workspace.
paste(..., sep = " ", collapse = NULL), paste0() – Pasting character strings together.

library(XML)

html3 <- htmlTreeParse("http-url", useInternalNodes=T)

xpathSApply(html3, "//td[@id='col-citedby']", xmlValue)

Packages:
httr – for working with http connections.
RMySQL – for interfacing with MySQL.
bigmemory – for handling data larger than RAM.
RHadoop – for interfacing R and Hadoop (by Revolution Analytics).
Foreign – for getting data into R from SAS, SPSS, Octave, etc.

四、Summarizing Data

dim(), names(), nrow(), ncol()
sapply(eData[1,], class) - apply a Function “class” over the first line of eData.
unique() – remove all the duplicate elements/rows.
length() – get or set the length.
table() –uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels, if useNA=”ifany”, then NA are counted.
eData[eData$Lat > 0 & eData$Lon > 0,c("Lat","Lon")] – return the subsets of eData$Lat > 0 and eData$Lon > 0 which columns are named "Lat" and "Lon" respectively.
eData[eData$Lat > 0 | eData$Lon > 0,c("Lat","Lon")] – return the subsets of eData$Lat > 0 or eData$Lon > 0 which columns are named "Lat" and "Lon" respectively.
is.na() – find if there are missing values.
rowSums(), rowMeans(), colSums(), colMeans() – Summarizing columns/rows, if na.rm=TRUE, then NA are removed.

五、Data Mungging Basics

tolower(), toupper() - Translate characters in character vectors from upper to lower case or vice versa.
strsplit(x, split) - Split the elements of a character vector x into substrings according to the matches to substring split within them.
sub(pattern, replacement, x), gsub(pattern, replacement, x) - Perform replacement of the first and all matches respectively.
cut() - Divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.
merge() - Merge two data frames by common columns or row names, or do other versions of database join operations.
sort(x, decreasing = FALSE, ...) - Sort (or order) a vector or factor (partially) into ascending or descending order. For ordering along more than one variable, e.g., for sorting data frames, see order().
order(..., na.last = TRUE, decreasing = FALSE) - Returns a permutation which rearranges its first argument into ascending or descending order.

Data Analysis - Week 2