2015年6月4日 星期四

Coursera R Programming Week1 心得筆記

R 語言程序開發
約翰霍普金斯大學 公共衛生學院
R Programming
Johns Hopkins Bloomberg School of Public Health

Week 1
課程介紹、第一週學習筆記

2015年6月1日 - 2015年6月29日

了解如何使用R進行編程以及如何使用R進行有效的數據分析。本課程是約翰霍普金斯(Johns Hopkins)數據科學專項課程的第二門課程。

課程類型
Information, Tech & Design
Statistics and Data Analysis

教授

Roger D. Peng, PhD - Johns Hopkins University
Jeff Leek, PhD - Johns Hopkins University
Brian Caffo, PhD - Johns Hopkins University

課程簡介
課程長度:4 weeks、7-9小時/週
語言:English
字幕:English, Español & 中文

課程概述
在本課程中,你將了解如何使用R進行編程以及如何使用R進行有效的數據分析。你將了解如何安裝和配置統計編程環境所需的軟件並說明通用的編程語言概念,因為該語言要在高級統計語言中進行實施。該課程涵蓋了統計計算中的實際問題,其中包括使用R進行編程、將數據讀入R、訪問R程序包、編寫R函數、調試、分析R代碼,以及組織和說明R代碼。統計數據分析主題將會提供使用示例。 

授課大綱
  • 第1週:R概覽(Overview of R)、R數據類型和對象(R data types and objects)、讀取和寫入數據(reading and writing data)
  • 第2週:控制結構(Control structures)、函數(functions)、作用域規則(scoping rules)、日期和時間(dates and times)
  • 第3週:循環函數(Loop functions)、調試工具(debugging tools)
  • 第4週:模擬(Simulation)、代碼分析(code profiling)

先修知識
熟悉編程概念以及統計推理基礎知識會有一定的幫助作用:數據科學工具箱

參考資料
The e-book R Programming for Data Science covers all of the material presented in this course. It is available for download from Leanpub.

授課形式
課程每週都會有視頻、測驗和編程作業。  

作為本課程的一部分,你需要設置GitHub帳戶。 Github是一種工具,用於共享和修改協作代碼。在學習本課程及本專項課程其他課程的過程中,你需要提交自己公開放置在Github帳戶下的文件鏈接,作為同伴互評作業的一部分。如果你擔心自己的身份被他人得知,那麼你需要註冊一個Github匿名帳戶,並且,切記不要添加你不想讓評估的同學看到的信息。

Data Science Specialization Community Site

Since the beginning of the Data Science Specialization, we've noticed the unbelievable passion students have about our courses and the generosity they show toward each other on the course forums. A couple students have created quality content around the subjects we discuss, and many of these materials are so good we feel that they should be shared with all of our students.

We're excited to announce that we've created a site using GitHub Pages: http://datasciencespecialization.github.io/ to serve as a directory for content that the community has created. If you've created materials relating to any of the courses in the Data Science Specialization, please send us a pull request so we can add a link to your content on our site. You can find out more about contributing here: https://github.com/DataScienceSpecialization/DataScienceSpecialization.github.io#contributing

We can't wait to see what you've created and where the community can take this site!
- The JHU Data Science Lab Team
Thu 4 Jun 2015 4:30 AM CST

R Programming: Week 1

As you browse the course web site, please make sure to read through the syllabus which contains important information about the grading policy for quizzes and programming assignments as well as the course schedule. 

Please pay particular attention to the differences among the various Programming Assignments. Whereas Programming Assignments 1 and 3 are graded via unit tests that use a submission script that will compare the output of your functions to the correct output, Programming Assignment 2 requires that you submit R code for evaluation and grading by your classmates.  

This week will cover the basics to get you started up with R. There are videos demonstrating how to install R on Windows and Mac. The Week 1 videos will cover the history of R and S, go over the basic data types in R, and describe the functions for reading and writing data. I recommend that you watch the videos in the order that they are listed on the web page, but watching the videos out of order isn't going to ruin the story. For each lecture video you can download a separate PDF document of the slides (the demonstration videos don't have slides associated with them).
Roger Peng and the Data Science Team
Mon 1 Jun 2015 9:01 PM CST

R Programming: Welcome to swirl

In this course, you have the option to use the swirl R package to practice some of the concepts we cover in lectures. swirl teaches you R programming and data science interactively, at your own pace, and right in the R console.

Each lesson that you complete in swirl is worth one extra credit point. However, the maximum number of points you may earn for the assignment is capped at 5. While these lessons will give you valuable practice and you are encouraged to complete as many as possible, please note that they are completely optional and you can get full marks in the class without completing them.

You can find the instructions for how to install and use swirl in the Programming Assignments section of the course under Week 1. Have fun!
Roger Peng and the Data Science Team
Mon 1 Jun 2015 9:01 PM CST

R Programming: Pre-Course Survey

Thanks for signing up for R Programming. As you probably know, this course is part of the Data Science Specialization, a sequence of nine massive open online courses (MOOCs) plus a Capstone project. We would like to learn more about your motives for taking this course and your intentions for both this course and the overall specialization. To help us out, please complete our short pre-course survey. It should only take about 3 minutes of your time.

Thanks,
The Data Science Team
Mon 1 Jun 2015 9:01 PM CST

Assessments
Quizzes

There will be one quiz every week. The quizzes will all open on the first day of the course but they will be due weekly. So the Week 1 Quiz will be due at the end of the first week and the Week 2 Quiz will be due at the end of the second week, etc.

Please refer to the individual weekly Quiz deadlines to see the exact date and time that each Quiz is due.

Programming Assignments

There will be three required programming assignments. The first programming assignment is due at the end of the second week. Subsequent programming assignments are due weekly after that.

Programming Assignments 1 and 3 will be graded via unit tests using a submission script that will compare the output of your functions to the correct output. To access Programming Assignments 1 and 3, click the corresponding link in the left navigation bar.

Programming Assignment 2 will be submitted differently and graded via a peer assessment. To access Programming Assignment 2, click the corresponding link in the left navigation bar.

swirl Programming Assignment (optional)

In this course, you have the option to use the swirl R package to practice some of the concepts we cover in lectures.

Each lesson that you complete in swirl is worth one extra credit point. However, the maximum number of points you may earn for the assignment is capped at 5. While these lessons will give you valuable practice and you are encouraged to complete as many as possible, please note that they are completely optional and you can get full marks in the class without completing them.

You can find the instructions for how to install and use swirl in the Programming Assignments section of the course under Week 1.

Grading
Quizzes

You may attempt each quiz up to 3 times. The score of your most successful atempt will count toward your grade.

Programming Assignments

Programming assignments 1 and 3 will require submissions via a submission script. You may make an unlimited number of submissions for of the programming assignments 1 and 3, and your most successful submission will count toward your grade. The swirl Programming Assignment is completely optional and extra credit.

Hard deadlines and soft deadlines for Quizzes 1-3 and Programming Assignment 1

The reported due date is the soft deadline for quizzes 1-3 and programming assignment 1. You may turn in quizzes 1-3 and programming assignment 1 up to five days after the soft deadline. The hard deadline is five days after the Quiz is due at 23:30 UTC. If you submit after the due date (but before the hard deadline), your submission score will be penalized by 10% for each day after the due date. If you use a late day, the 10% per day penalty will not be applied for that day.

**Please note: There is no partial credit grace period for Quiz 4 or Programming Assignments 2 and 3. Those deadlines are firm, and work submitted after the hard deadline will not receive credit.

Late Days for Quizzes and Programming Assignment 1

You are permitted a total of 5 late days for quizzes and assignments in the course. If you use a late day, your quiz or assignment grade will not be affected if it is submitted late.

No Late Days for Programming Assignment 2

Peer assessments deadlines have to be synchronous. Therefore, Late Days cannot be applied to Programming Assignment 2. Only one deadline can be set for students to submit and peer-grade each other's work. This is necessary in order to maintain a synchronized peer grading process.

Points and scoring

There are 100 points available in the course. The breakdown of points is as follows:
  • Week 1 Quiz - 20 points
  • Week 2 Quiz - 10 points
  • Week 3 Quiz - 5 points
  • Week 4 Quiz - 10 points
  • Programming Assignment 1 (Air Pollution) - 20 points
  • Programming Assignment 2 (Lexical Scoping) - 10 points
  • Programming Assignment 3 (Hospital Quality) - 25 points
  • swirl Programming Assignment - Maximum of 5 extra credit points
You must earn 70 points to pass the course and earn a certificate. Students who earn 90 points and above will receive a certificate with Distinction.

Week 1: Getting Started and R Nuts and Bolts

This week is all about getting started with R and learning some of the basic details of the language. If you haven't already installed R, you should go to the R web site and download R for your platform (Windows, Mac, or Unix/Linux). Also, if you want, you can download RStudio, which is a free interactive development environment designed for R that is very useful and we use quite a bit in the Data Science Specialization. I've made some videos to help you along with the installation process:
  • Installing R on Windows
  • Installing R on a Mac
  • Installing R on RStudio (on a Mac)
Before you start using R, one key concept is the working directory. This is the directory/folder on your computer where you will store project files, data, and code. It's important that you tell R where the working directory is that you will be using so that it knows where to find the appropriate file (the working directory can be any directory on your computer). These videos tell you how to set your working directory:
  • Setting your working directory (Windows)
  • Setting your working directory (Mac)

Learning Objectives

By the end of week 1 you should be able to:
  • Install the R and RStudio software packages
  • Download and install the swirl package for R
  • Describe the history of the S and R programming lectures
  • Describe the differences between atomic data types
  • Execute basic arithmetic operations
  • Subset R objects using the "[", "[[", and "$" operators and logical vectors
  • Describe the explicit coercion feature of R
  • Remove missing (NA) values from a vector

Programming

There is no official graded programming assignment for this week. However, we have developed a series of exercises to get you started with R. These exercises are implemented using the swirl package for R. You can read the instructions for how to install and use swirl to practice some of the concepts in R.
  • Introduction to swirl
  • swirl programming assignment instructions
  • swirl homepage
The swirl programming assignment is NOT required. However, if you complete the swirl programming assignment you can receive up to 5 extra credit points toward your final class score.

Video Lectures
Background Material
Writing Code / Setting Your Working Directory (Windows)

如何設置工作目錄以及如何編輯R代碼文件

工作目錄是R在電腦上讀寫文件的地方,可以用getwd()找到目前設置的工作目錄。
為什麼了解和設置工作目錄是重要的呢?因為當你使用像read.csv()或write.csv()這些功能讀寫數據時,讀取或寫入的步驟將在你的工作目錄下運行。

更換工作目錄:檔案→變更現形目錄

> getwd()
[1] "D:/[Coursera]R Programming"

寫R:檔案→建立新的命令稿

myfunction <- function() {
x <- rnorm(100)
mean(x)
}
# 給出100個正態隨機變量的平均值

可以把整個函數鍵入在R Console以輸入函數;
或者先將檔案儲存,並在控制台執行dir(),就能看見儲存的檔案。可以用source()將它導入R,在每次輸入新的函數後要記得儲存再導入。

Week 1
Overview and History of R

What is S ?
Statistical Models in S
Tho book Programming with Data

Some R Resource
  • An Introduction to R
  • Writing R Extensions
  • R Data Import/Export
  • R Installation and Administration》(mostly for building R from sources)
  • R Internals》(not for the faint of heart)

Some Useful Books on S/R
Standard texts
  • 2008《Chambers》, Software for Data Analysis, Springer. (your textbook)
  • 1998《Chambers》, Programming with Data, Springer
  • 2002《Venables & Ripley》, Modern Applied Statistics with S, Springer
  • 2000《Venables & Ripley》, S Programming, Springer
  • 2000《Pinheiro & Bates》, Mixed-Effects Models in S and S-PLUS, Springer
  • 2005《Murell》, R Graphics, Chapman & Hall/CRC Press

Other resources

Getting Help

R Console Input and Evaluation

Data Types - R Objects and Attributes
Objects
  • character 字符型
  • numeric (real numbers) 數值型(包含實數和小數)
  • integer 整數型
  • complex 複數型
  • logical (True/False) 邏輯型
R語言裡最基本的對象是向量(vector)

但一個標準的向量不能包含不同類型的對象,同一個向量內的所有對象都必須是同一類型。

有一種向量可以包含多種類型的對象,這種向量稱為列表(list),它是一個由多個對象所組成的序列,只不過其中每個元素的類型可以不同。

一個向量函數有兩個基本參數:第一個是你想要這個向量包含的對象的類型;第二個參數是這個向量自身的長度。

R語言裡的數字通常被稱作數值型對象,幾乎所有的數字都被作為雙精度實數來處理。如果你確實想要一個整數的話,你可以在數字後面加上大寫的L。

Inf表示無窮(infinity),它也可以像實數一樣參與運算。e.g. 1 / 0 is Inf; e.g. 1 / Inf is 0.
同樣也有-Inf。

另外還有一個特殊值叫NaN,它表示一個未定義的值:非數值(not a number)。e.g. 0 / 0 is NaN. NaN也可被認為是缺失值(missing value)。

Attributes
  • names, dimnames
  • dimensions (e.g. matrices, arrays)
  • class 類,每個對象都屬於一個類。例如,數值對象的類是數值;整數對象的類是整數。
  • length
  • other user-defined attributes/metadata

Data Types - Vectors and Lists

c(concatenate)可以把對象串聯在一起。

> x <- c(0.5,0.6)     ## numeric
> x <- c(T,F)         ## logical
> x <- c(TRUE, FALSE) ## logical
> x <- c("a","b","c") ## character
> x <- 9:29           ## integer
> x <- c(1+0i, 2+4i)  ## complex

> x <- vector("numeric", length = 10)
> x
 [1] 0 0 0 0 0 0 0 0 0 0

如果你想創建一個包含兩種不同類型對象的向量呢?
原則上,R會創建一個最低級公共類型(least common denominator)的向量,所以它不會是錯誤,而是將向量強制轉換(coerce)為兩者的最低級公共類型。

> y <- c(1.7, "a")    ## character -> c("1.7", "a")
> y <- c(TRUE, 2)     ## numeric -> c(1,0)
> y <- c("a", FALSE)  ## character -> c("a", "FALSE")

Explicit Coercion

你可以用 as.* 函數來轉換對象。

> x <- 0:6
> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> as.complex(x)
[1] "0" "1" "2" "3" "4" "5" "6"

但強制轉換類型未必會成功,若無意義的轉換就會導致NA。

> x <- c("a", "b", "c")
> as.numeric(x)
[1] NA NA NA
Warning message:
NAs introduved by coercion
> as.logical(x)
[1] NA NA NA

Lists

> x <- list(1, "a", TRUE, 1+4i)
> x
[[1]]
[1] 1

[[2]]
[1] "1"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i

Data Types - Matrices

矩陣(Matrices)不是一種獨立的對象類型,而是具有維度(dimension)屬性的向量。維度屬性是一個長度為二的整型向量,其中第一個數字是矩陣的列數(row),第二個是矩陣的行數(column)。

> m <- matrix(nrow = 2, ncol = 3)
> m
> m
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA
> dim(m)
[1] 2 3
> attributes(m)
$dim
[1] 2 3

矩陣是按行建構的,所以你可以把矩陣想像成把一個向量裡所有的數按行填入矩陣中:先填第一行,當第一行達到最大列數時,填第二行,以此類推。

> m <- matrix(1:6 nrow = 2, ncol = 3)
> m
> m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> m <- 1:10
> m
[1] 1 2 3 4 5 6 7 8 9 10
> dim(m) <- c(2, 5)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10
> x <- 1:3
> y <- 10:12
> cbind(x,y)
     x  y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x,y)
  [,1] [,2] [,3]
x    1    2    3
y   10   11   12

Data Types - Factors

因子(Factors)通常用來記錄分類數據,它有兩種類型:一種是無序的;一種是有序的。用來標記那些可以分類但又沒有順序的數據,或可你可以用有序因子來標記那些有先後次序的數據。這些數據可以不是數值類型但卻是有序的,因此,你可以把因子當作一個整型向量,其中每一個整數都有一個標籤,你可以想像一個1、2、3組成的向量,其中1代表較高的值,2、3則表示中、低的值。

因子的重要性在於,它們在建模函數如lm()和glm()中會受到特別對待,這些函數主要用來擬合線性模型。

一般來說,使用有標籤的因子會比只用整數向量好,因為因子具有自描述性。用一個包含男女訊息的向量來描述,會比一個只包含1和2的變量更直觀。

> x <- factor(c("yes", "yes", "no", "yes", "no"))
> x
[1] yes yes no yes no 
Levels: no yes  ## 照字母順序決定基線水平(no)和第二個水平(yes)
> table(x)      ## 頻率
x
 no yes 
  2   3 
> unclass(x)    ## 移除一個向量的類型
[1] 2 2 1 2 1
attr(,"levels")
[1] "no"  "yes" ## yes: 2; no: 1
> x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))  ## 調整水平(levels)順序
[1] yes yes no yes no 
Levels: yes no

Data Types - Missing Values

缺失值(Missing Values)是用NA或NaN來表示的,NaN表示未定義的數學運算,NA則用來表示其它的缺失值。
  • is.na() 可用來檢驗對象是否為NA
  • is.nan() 可用來檢驗對象是否為NaN
  • NA可以是整型的缺失值、也可以是字符型或是數值型。
  • NaN也可以被看作是NA,但一個NA值不一定是NaN值。

> x <- c(1, 2, NA, 10, 3)
> is.na(x)
[1] FALSE FALSE  TRUE FALSE FALSE
> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE
> x <- c(1, 2, NaN, NA, 4)
> is.na(x)
[1] FALSE FALSE  TRUE  TRUE FALSE
> is.nan(x)
[1] FALSE FALSE  TRUE FALSE FALSE

Data Types - Data Frames

數據框(Data Frames)是用來儲存表格數據的重要數據類型,它是一種特殊數據類型的列表。列表中的每個元素都有同樣的長度,你可以把數據框中的每一行都視為列表中的一個元素。當然,為了成為一個表格,每一行的長度都必須相同。但每一行並不必須是同一類型。

矩陣中的每一個元素都必須儲存同樣類型的對象;數據框則可以儲存不同類型的對象。

它還有一些特殊的特性:
  • row.name 數據框的每一列都有一個名字
  • 可用 read.table()或read.csv() 來創建數據框
  • 可用 data.matrix() 將一個數據框轉換成一個矩陣(強制轉換成相同類型)

當然,也可用 data.frame() 創建數據框:

> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
> x
  foo   bar
1   1  TRUE
2   2  TRUE
3   3 FALSE
4   4 FALSE
> nrow(x)
[1] 4
> ncol(x)
[1] 2

Data Types - Names Attribute

Names可以保證代碼的可讀性以及創建具有自描述性的對象是非常有用的。

> x <- 1:3
> names(x)
NULL
> names(x) <- c("foo", "bar", "norf")
> x
 foo  bar norf 
   1    2    3 
> names(x)
[1] "foo"  "bar"  "norf"

> x <- list(a = 1, b = 2, c = 3)
> x
$a
[1] 1

$b
[1] 2

$c
[1] 3

> m <- matrix(1:4, nrow = 2, ncol = 2)
> dimnames(m) <- list(c("a", "b"), c("c", "d"))
> m
  c d
a 1 3
b 2 4

Data Types - Summary

Data Types
  • atomic classes: numeric, logical, character, integer, complex
  • vectors, lists
  • factors
  • missing values
  • data frames
  • names

Reading Tabular Data
  • read.table, read.csv, 讀取一種以行列的格式儲存數據的文件
  • readLines, 逐行讀取文件
  • source, 讀取R代碼的函數inverse of dump
  • dget, 讀取逆句法分析過後以文件儲存的R對象(inverse of dput
  • load, 讀取儲存的工作區
  • unserialize, 讀取二進位對象

  • write.table
  • writeLines
  • dump
  • dput
  • save
  • serialize

read.table
  • file, 文件或連結名稱
  • header, 邏輯標誌,表示第一行是否為開頭(變量名)
  • sep, 分隔符,是一個字串符,標示每一行是如何分隔的(逗號、冒號、Tab或Space)
  • colClasses, 字符向量,長度與數據集的行數相等,表示數據集中每一列數據的類,它不是一個必須的向量,但它會告訴read.table()每一行數據的類型
  • nrows, 數據集中數據的行數
  • comment.char, 字串符,表示文件中用於注釋的字符,默認為#
  • skip, 指定從文件開頭往下忽略多少列
  • stringsAsFactors, 默認為TRUE,通過它可以選擇是否把字符變量編碼成因子

read.csv()和read.table()是等價的,read.csv()的默認分隔符是逗號,read.table則是空格。簡單來說,read.csv()對讀取csv文件很有用,csv代表comma separated value(用逗號分隔的數據),通常是從電子表格(像Excel)中得到的數據格式,read.csv()會默認把header設置為TRUE。

Reading Large Tables

優化read.table()有幾個方式:
計算你想讀取的數據需要多少儲存空間,這樣你能知道你的電腦是否有足夠的內存來儲存這些數據,因為R默認將整個數據集載入到電腦內存。
如果你的數據中沒有注釋,可以把comment.char設置為空(comment.char = "")。
設置colClasses,不然R默認會把每一列都掃一遍。

initial <- read.table("datatable.txt", nrows = 100)
classes <- sapply(initial, class)
tabAll <- read.table("datatable.txt", colClasses = classes)

Calculating Memory Requirements

假設有一個含1,500,000列和120行的數據框,假設每一列都是數值型數據。

1,500,000 × 120 × 8 bytes/numeric
= 1,440,000,000 bytes
= 1,440,000,000 / 2^20 bytes/MB
= 1,373.29 MB
= 1.34 GB

這些是數據集所需的內存,加上實際讀取所需的一些額外需求,按照經驗用read.table()大約需要2倍數據集的內存。

Textual Data Formats

Dputing

> y <- data.frame(a = 1, b = "a")
> dput(y)
structure(list(a = 1,
               b = structure(1L, .Label = "a", class = "factor")),                  .Names = c("a", "b"),
               row.names = c(NA, -1L),
               class = "data.frame")
> dput(y, file = "y.R")
> new.y <- dget("y.R")
> new.y
  a b
1 1 a

Dumping

> x <- "foo"
> y <- data.frame(a = 1, b = "a")
> dump(c("x", "y"), file = "data.R")
> rm(x, y)  ## clear x, y
> source("data.R")
> y
  a b
1 1 a
> x
[1] "foo"

Connections: Interfaces to the Outside World
  • file
  • gzfile
  • bzfile
  • url

> str(file)
function (description = "", open = "", blocking = TRUE, 
          encoding = getOption("encoding"), raw = FALSE)
  • "r" read only
  • "w" writing (and initializing a new file)
  • "a" appending
  • "rb", "wb", "ab" reading, writing, or appending in binary mode (Windows)

con <- file("foot.txt", "r")
data <- read.csv(con)
close(con)

等同於

data <- read.csv("foor.txt")
> con <- gzfile("words.gz")
> x <- readLines(con, 10)
> x
 [1] "1080"     "10-point" "10th"     "11-point"
 [5] "12-point" "16-point" "18-point" "1st"
 [9] "2"        "20-point"

## This might take time
> con <- url("http://www.jhsph.edu", "r")
> x <- readLines(con)
> head(x)
[1] "<!DOCTYPE html>"                                               
[2] "<html lang=\"en\">"                                            
[3] ""                                                              
[4] "<head>"                                                        
[5] "<meta charset=\"utf-8\" />"                                    
[6] "<title>Johns Hopkins Bloomberg School of Public Health</title>"

Subsetting - Basics

> x <- c("a", "b", "c", "c", "d", "a")
> x[1]
[1] "a"
> x[2]
[1] "b"
> x[1:4]
[1] "a" "b" "c" "c"
> x[x > "a"]
[1] "b" "c" "c" "d"
> u <- x > "a"
> u
[1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE
> x[u]
[1] "b" "c" "c" "d"

Subsetting - Lists

> x <- list(foo = 1:4, bar = 0.6)
> x[1]
$foo
[1] 1 2 3 4
> x[[1]]
[1] 1 2 3 4
> x$bar
[1] 0.6
> x[["bar"]]
[1] 0.6
> x["bar"]
$bar
[1] 0.6

> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> x[c(1, 3)]
$foo
[1] 1 2 3 4

$baz
[1] "hello"

[[]]相較於$的好處在,雙方括號可以使用需要計算的索引

> name <- "foo"
> x[[name]] ## computed index for 'foo'
[1] 1 2 3 4
> x$name    ## element 'name' doesn't exist!
NULL
> x$foo     ## element 'foo' does exist
$foo
[1] 1 2 3 4

> x <- list(a = list(10, 12, 14), b = c(3.14, 2.81))
> x[[c(1, 3)]]
[1] 14
> x[[1]][[3]]
[1] 14
> x[[c(2, 1)]]
[1] 3.14

Subsetting - Matrices

> x <- matrix(1:6, 2, 3)
> x
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> x[1, 2]
[1] 3
> x[2, 1]
[1] 2
> x[1, ]
[1] 1 3 5
> x[, 2]
[1] 3 4
> x[1, 2, drop = FALSE]  ## 1×1 Matrix
     [,1]
[1,]    3
> x[1, , drop = FALSE]   ## 1×3 Matrix
     [,1] [,2] [,3]
[1,]    1    3    5

Subsetting - Partial Matching

> x <- list(aardvark = 1:5)
> x$a
[1] 1 2 3 4 5
> x[["a"]]
NULL
> x[["a", exact = FALSE]]
[1] 1 2 3 4 5

Subsetting -  Removing Missing Values

> x <- c(1, 2, NA, 4, NA, 5)
> bad <- is.na(x)
> x[!bad]
[1] 1 2 4 5
> x <- c(1, 2, NA, 4, NA, 5)
> y <- c("a", "b", NA, "d", NA, "f")
> good <- complete.cases(x, y)
> good
[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE
> x[good]
[1] 1 2 4 5
> y[good]
[1] "a" "b" "d" "f"
> airquality[1:6, ]  ## R的內建資料
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
> good <- complete.cases(airquality)
> airquality[good, ][1:6, ]
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
7    23     299  8.6   65     5   7
8    19      99 13.8   59     5   8

Vectorized Operations

向量化

> x <- 1:4; y <- 6:9
> x + y
[1]  7  9 11 13
> x > 2
[1] FALSE FALSE  TRUE  TRUE
> x >= 2
[1] FALSE  TRUE  TRUE  TRUE
> y == 8
[1] FALSE FALSE  TRUE FALSE
> x * y
[1]  6 14 24 36
> x / y
[1] 0.1666667 0.2857143 0.3750000 0.4444444
> x <- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2)
> y
     [,1] [,2]
[1,]   10   10
[2,]   10   10
> x * y       ## not true matrix multiplication
     [,1] [,2]
[1,]   10   30
[2,]   20   40
> x / y
     [,1] [,2]
[1,]  0.1  0.3
[2,]  0.2  0.4
> x %*% y     ## true matrix multiplication
     [,1] [,2]
[1,]   40   40
[2,]   60   60