一、Steps in a Data Analysis
- Define the question
- Define the ideal data set
- Determine what data you can access
- Obtain the data
- Clean the data
- Exploratory data analysis
- Statistical prediction/modeling
- Interpret results
- Challenge results
- Synthesize/write up results
- Create reproducible code
二、Data Analysis Files
(一)Data
1、Raw data
-Should be stored in your analysis folder
-If accessed from the web, include url, description, and date accessed in README
2、Processed data
-Processed data should be named so it is easy to see which script generated the data
-The processing script – processed data mapping should occur in the README
-Processed data should be tidy
(二)Figures
1、Exploratory figures
-Figure made during the course of your analysis, not necessarily part of your final report
-They do not need to be “pretty”
2、Final figures
-Usually a small subset of the original figures
-Axes/colors set to make the figure clear
-Possibly multiple panels
(三)R code
1、Raw scripts
-Maybe less commented (but comments help you!)
-Maybe multiple versions
-May include analysis that are later discarded
2、Final scripts
-Clearly commented
-Small comments liberally – what, when, why, how
-Bigger commented blocks for whole sections
-Include processing details
-Only analyses that appear in the final write-up
3、R Markdown files (optional)
-R markdown files can be used to generate reproducible reports
-Text and R code are integrated
-Very easy to create in Rstudio
(四)Text
1、Readme files
-Not necessary if you use R markdown
-Should contain step-by-step instructions for analysis
2、Text of analysis
-It should include a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)
-It should tell a story
-It should not include every analysis you performed
-References should be included for statistical methods
三、Getting Data
- getwd() – get the working directory.
- setwd(“C:\\path\\to\\directory”) – set the working directory.
- download.file(fileUrl, destfile=”C:\\path\\filename.csv”) – get data from the internet.
- list.file(“C:\\path”) – “C:\\path”中所有文件名.
readData <- read.table(filename, sep=””, header=TRUE) head(readData, rows) – 取得表头名称及前rows行数据. readData <- read.csv(file.choose()) – 打开文件对话框选择文件,less reproducible, but useful. 相关函数:read.csv, read.xlsx(), read.xlsx2().
1 2 3 |
con <- file("C:\\path\\to\\file", "r") readData <- read.csv(con) close(con) |
1 2 3 |
con <- url("http://vartang.com", "r") varData <- readLines(con) close(con) |
1 2 3 |
# 删除data.frame某行或某列: con <- read.csv("C:\\somefile.csv") con2 <- con[,-1:-3] # 删除第一到第三列 |
- write.table() – The opposite of read.table().
- save(), save.image – Save R objects (*.rda).
- load() – Opposite of save().
- ls() - Return a vector of character strings giving the names of the objects in the specified environment.
- rm(list=ls()) – Remove everything from the workspace.
- paste(..., sep = " ", collapse = NULL), paste0() – Pasting character strings together.
1 2 3 |
library(XML) html3 <- htmlTreeParse("http-url", useInternalNodes=T) xpathSApply(html3, "//td[@id='col-citedby']", xmlValue) |
Packages:
httr – for working with http connections.
RMySQL – for interfacing with MySQL.
bigmemory – for handling data larger than RAM.
RHadoop – for interfacing R and Hadoop (by Revolution Analytics).
Foreign – for getting data into R from SAS, SPSS, Octave, etc.
四、Summarizing Data
- dim(), names(), nrow(), ncol()
- sapply(eData[1,], class) - apply a Function “class” over the first line of eData.
- unique() – remove all the duplicate elements/rows.
- length() – get or set the length.
- table() –uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels, if useNA=”ifany”, then NA are counted.
- eData[eData$Lat > 0 & eData$Lon > 0,c("Lat","Lon")] – return the subsets of eData$Lat > 0 and eData$Lon > 0 which columns are named "Lat" and "Lon" respectively.
- eData[eData$Lat > 0 | eData$Lon > 0,c("Lat","Lon")] – return the subsets of eData$Lat > 0 or eData$Lon > 0 which columns are named "Lat" and "Lon" respectively.
- is.na() – find if there are missing values.
- rowSums(), rowMeans(), colSums(), colMeans() – Summarizing columns/rows, if na.rm=TRUE, then NA are removed.
五、Data Mungging Basics
- tolower(), toupper() - Translate characters in character vectors from upper to lower case or vice versa.
- strsplit(x, split) - Split the elements of a character vector x into substrings according to the matches to substring split within them.
- sub(pattern, replacement, x), gsub(pattern, replacement, x) - Perform replacement of the first and all matches respectively.
- cut() - Divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.
- merge() - Merge two data frames by common columns or row names, or do other versions of database join operations.
- sort(x, decreasing = FALSE, ...) - Sort (or order) a vector or factor (partially) into ascending or descending order. For ordering along more than one variable, e.g., for sorting data frames, see order().
- order(..., na.last = TRUE, decreasing = FALSE) - Returns a permutation which rearranges its first argument into ascending or descending order.
Speak Your Mind