Data Analysis - Week 2

一、Steps in a Data Analysis
  1. Define the question
  2. Define the ideal data set
  3. Determine what data you can access
  4. Obtain the data
  5. Clean the data
  6. Exploratory data analysis
  7. Statistical prediction/modeling
  8. Interpret results
  9. Challenge results
  10. Synthesize/write up results
  11. Create reproducible code
二、Data Analysis Files
(一)Data

  1、Raw data
    -Should be stored in your analysis folder
    -If accessed from the web, include url, description, and date accessed in README
  2、Processed data
    -Processed data should be named so it is easy to see which script generated the data
    -The processing script – processed data mapping should occur in the README
    -Processed data should be tidy

(二)Figures

  1、Exploratory figures
    -Figure made during the course of your analysis, not necessarily part of your final report
    -They do not need to be “pretty”
  2、Final figures
    -Usually a small subset of the original figures
    -Axes/colors set to make the figure clear
    -Possibly multiple panels

(三)R code

  1、Raw scripts
    -Maybe less commented (but comments help you!)
    -Maybe multiple versions
    -May include analysis that are later discarded
  2、Final scripts
    -Clearly commented
    -Small comments liberally – what, when, why, how
    -Bigger commented blocks for whole sections
    -Include processing details
    -Only analyses that appear in the final write-up
  3、R Markdown files (optional)
    -R markdown files can be used to generate reproducible reports
    -Text and R code are integrated
    -Very easy to create in Rstudio

(四)Text

  1、Readme files
    -Not necessary if you use R markdown
    -Should contain step-by-step instructions for analysis
  2、Text of analysis
    -It should include a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)
    -It should tell a story
    -It should not include every analysis you performed
    -References should be included for statistical methods

三、Getting Data
  • getwd() – get the working directory.
  • setwd(“C:\\path\\to\\directory”) – set the working directory.
  • download.file(fileUrl, destfile=”C:\\path\\filename.csv”) – get data from the internet.
  • list.file(“C:\\path”) – “C:\\path”中所有文件名.

readData <- read.table(filename, sep=””, header=TRUE) head(readData, rows) – 取得表头名称及前rows行数据. readData <- read.csv(file.choose()) – 打开文件对话框选择文件,less reproducible, but useful. 相关函数:read.csv, read.xlsx(), read.xlsx2().

  • write.table() – The opposite of read.table().
  • save(), save.image – Save R objects (*.rda).
  • load() – Opposite of save().
  • ls() - Return a vector of character strings giving the names of the objects in the specified environment.
  • rm(list=ls()) – Remove everything from the workspace.
  • paste(..., sep = " ", collapse = NULL), paste0() – Pasting character strings together.

Packages:
httr – for working with http connections.
RMySQL – for interfacing with MySQL.
bigmemory – for handling data larger than RAM.
RHadoop – for interfacing R and Hadoop (by Revolution Analytics).
Foreign – for getting data into R from SAS, SPSS, Octave, etc.

四、Summarizing Data
  • dim(), names(), nrow(), ncol()
  • sapply(eData[1,], class) - apply a Function “class” over the first line of eData.
  • unique() – remove all the duplicate elements/rows.
  • length() – get or set the length.
  • table() –uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels, if useNA=”ifany”, then NA are counted.
  • eData[eData$Lat > 0 & eData$Lon > 0,c("Lat","Lon")] – return the subsets of eData$Lat > 0 and eData$Lon > 0 which columns are named "Lat" and "Lon" respectively.
  • eData[eData$Lat > 0 | eData$Lon > 0,c("Lat","Lon")] – return the subsets of eData$Lat > 0 or eData$Lon > 0 which columns are named "Lat" and "Lon" respectively.
  • is.na() – find if there are missing values.
  • rowSums(), rowMeans(), colSums(), colMeans() – Summarizing columns/rows, if na.rm=TRUE, then NA are removed.
五、Data Mungging Basics
  • tolower(), toupper() - Translate characters in character vectors from upper to lower case or vice versa.
  • strsplit(x, split) - Split the elements of a character vector x into substrings according to the matches to substring split within them.
  • sub(pattern, replacement, x), gsub(pattern, replacement, x) - Perform replacement of the first and all matches respectively.
  • cut() - Divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.
  • merge() - Merge two data frames by common columns or row names, or do other versions of database join operations.
  • sort(x, decreasing = FALSE, ...) - Sort (or order) a vector or factor (partially) into ascending or descending order. For ordering along more than one variable, e.g., for sorting data frames, see order().
  • order(..., na.last = TRUE, decreasing = FALSE) - Returns a permutation which rearranges its first argument into ascending or descending order.

Speak Your Mind

*