2015年6月27日 星期六

聖母峰之死 Into Thin Air

聖母峰之死
Into Thin Air
Jon Krakauer──著
宋碧雲、林曉欽──譯
大家出版社──出版
via cite.com
「人類演出悲劇,是因為他們不相信現實中的悲劇。但悲劇,其實都發生在文明世界裡。」
“Men play at tragedy because they do not believe in the reality of the tragedy which is actually being staged in the civilised world.”
──José Ortega y Gasset 
via wiki
 It would seem almost as though there were a cordon drawn round the upper part of these great peaks beyond which no man may go. The truth of course lies in the fact that, at altitudes of 25,000 feet and beyond, the affects of low atmospheric pressure on the human body are so severe that really difficult mountaineering is impossible and the consequences even of a mild storm may be deadly, that nothing but the most perfect combinations of weather and snow offers the slightest chance of success, and that on the last lap of the climb no party is in position to choose its day...
 No, it is not remarkable that Everest did not yield to the first few attempts; indeed it would have been very surprising and not a little sad if it had, for that is not the way of great mountains. Perhaps we had become a little arrogant with our fine new technique of ice-claw and rubber slipper, our age of easy mechanical conquest. We had forgotten that the mountain still holds the master card, that it will grant success only in its own good time. Why else does mountaineering retain its deep fascination?
Eric Shipton, in 1938, 
Upon that Mountain
  這些巍峨高峰的上半部彷彿劃出一條警戒線,誰也越不過。癥結在於,到了海拔七千六百公尺以上,低氣壓對人體的影響極為劇烈,根本不可能進行真正艱困的登山活動,一場輕微的暴風雪就可能帶來致命的後果,唯有最完美的氣候和積雪情況能提供些微的成功機會,而在攀登的最後階段誰也不可能挑日子……
  不,聖母峰在一開始沒讓人輕易得逞,這並不足為怪。說真的,聖母峰若輕易投降才叫人吃驚,而且將非常可悲,有失大山風範。也許我們有了冰爪和橡皮便鞋等優良新科技,有了長年以機械輕鬆征服萬物的歷史,變得有些傲慢了。我們忘記高山仍握有王牌,只在自己覺得恰當的時機頒出成功的獎牌。否則登山怎麼會深深蠱惑人心呢?
席普頓,一九三八年
《那座山上》

2015年6月25日 星期四

Coursera R Programming Week4 心得筆記

R 語言程序開發
約翰霍普金斯大學 公共衛生學院
R Programming
Johns Hopkins Bloomberg School of Public Health

Week 4
第四週學習筆記


2015年6月1日 - 2015年6月29日

Week 4: Simulation and Profiling

This week covers how to simulate data in R, which serves as the basis for doing simulation studies. We also cover the profiler in R which lets you collect detailed information on how your R functions are running and to identify bottlenecks that can be addressed. The profiler is a key tool in helping you optimize your programs. Finally, we cover the str function, which I personally believe is the most useful function in R.

Learning Objectives

By the end of this week you should be able to:
  • Call the str function on an arbitrary R object
  • Describe the difference between the "by.self" and "by.total" output produced by the R profiler
  • Simulate a random normal variable with an arbitrary mean and standard deviation
  • Simulate data from a normal linear model

Programming

There is a graded programming assignment for this week.
  • Programming assignment 3: Hospital Quality

2015年6月20日 星期六

Coursera R Programming Week3 心得筆記

R 語言程序開發
約翰霍普金斯大學 公共衛生學院
R Programming
Johns Hopkins Bloomberg School of Public Health

Week 3
第三週學習筆記


2015年6月1日 - 2015年6月29日

R Programming: Week 3

We have now entered the third week of R Programming which also marks the halfway point. The lectures this week cover loop functions and the debugging tools in R. These aspects of R make R useful for both interactive work and writing longer code, and so they are commonly used in practice.

The Programming Assignment is challenging and so I encourage you to start early if you have the chance. It requires you to explore some of the more interesting aspects of the R language, including taking advantage of the scoping rules to implement state preservation in R objects.

Note that the programming assignment this week is implemented as a Peer Assessment so you will not see it listed with the other Programming Assignments. Please go to the Programming Assignment 2 section of the course to find the assignment instructions. Also, for this assignment, you will need to setup your GitHub account if you have not yet done so.

Best of luck!
Roger Peng and the Data Science Team
Mon 15 Jun 2015 8:01 AM CST

Week 3: Loop Functions and Debugging

This week is what I call "loop functions" in R, which are functions that allow you to execute loop-like behavior in a compact form. These functions typically have the word "apply" in them and are particularly convenient when you need to execute a loop on the command line when using R interactively. These functions are some of the more interesting functions of the R language. This week we also cover the debugger that comes with R and how to interpret its output to help you find problems in your programs and functions.

Learning Objectives

By the end of this week you should be able to:
  • Define an anonymous function and describe its use in loop functions [see lapply]
  • Describe how to start the R debugger for an arbitrary R function
  • Describe what the traceback() function does and what is the function call stack

Programming

There is a graded programming assignment for this week. Please note that this assignment is graded via peer assessment.
  • Programming assignment 2: Lexical Scoping

2015年6月17日 星期三

Coursera The Data Scientist’s Toolbox Week3 心得筆記

數據科學家的工具箱
約翰霍普金斯大學 公共衛生學院
The Data Scientist’s Toolbox
Johns Hopkins Bloomberg School of Public Health

Week 3
第三週學習筆記

2015年6月1日 - 2015年6月29日

Video Lectures
Week 3 (34:38)

Types of Data Science Questions (9:09)

我們大致按照實際達到分析目標的難度來排序下列幾個問題:
  1. Descriptive 描述性分析
  2. Exploratory 探索性分析
  3. Inferential 推斷分析
  4. Predictive 預測分析
  5. Causal 因果分析
  6. Mechanistic 機理分析

描述性分析

這種分析的目標僅是描述一組數據。在這種描述的基礎上,你不需要做任何決定或類似的事情,數據的描述和解釋是兩個不同的步驟。
在沒有額外的統計建模的基礎上,這些描述通常沒什麼普遍性,你所描述的只是你在這組數據中看到的情況,但你不能說,下一個人得到的情況也會是怎樣。
描述性分析也是人口普查時最常用的分析類型。

United States Census 2010

Google Ngram Viewer

這些例子都只是單純地描述發生了什麼,你不能用它們來預測!

探索性分析

在此分析類型中,你試著去觀察數據,並發現之前未知的關係,但不一定要確認這種關係,因此這種分析有利於發現新的關聯,同時有助於確定今後的數據科學項目,在其中你所做的其實就是試圖證實你所進行的探索。
對於任一實際問題,探索性分析通常都不具最終發言權,通常也不應被用於歸納或預測。
重要的一點,你可能也聽說過的是:「相關性並不表示因果關係。」

Liu et al. (2012) Scientific Reports

The Sloan Digital Sky Survey

推斷分析

推斷分析的目標是在少量觀察的基礎上,根據一小部分的數據將得到的訊息歸納、外推到更大的群體,你聽過的絕大部分的統計模型和數據都是用於推斷分析。它包括了估計你感興趣的東西的數量以及估計數量的不確定性,這在很大程度上依賴於你所觀察的母體以及你使用的抽樣方法。

Correia et al. (2013) Epidemiology(美國空氣汙染控制和平均壽命的關係)

預測分析

預測分析是利用從某些對象收集到的數據,去預測下次觀察可能碰到的另一個對象的值。
重要的一點是,即使通過x預測到了y,也不能說是x導致了y。
精準的預測很大程度上依賴於測量正確的變量,儘管預測模型有好有壞,可以肯定的是數據越多且模型越簡單,預測效果往往就越好。

「預測是很困難的,尤其是對未來的預測。」


http://fivethirtyeight.blogs.nytimes.com/(Nate Silver在他的部落格Five Thirty Eight上預測美國大選)


How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did


因果分析

因果分析在於了解:如果改變一個變量的值會發生什麼?這會對另一個變量的值造成什麼改變?
一般來講,實施因果分析的權威標準是:利用隨機研究或隨機對照試驗來確認因果關係。你可以透過觀察保存在數據庫中的數據來進行研究,但這難以說服別人。你必須對你的模型運行的方式,給出更有力的假設。人們通常認為因果關係是一種平均的效果,換句話說,如果給一個群體一種特定的藥,那麼他們平均會比沒有服用藥的群體活的更久一些。對大部分目標是得到變量之間的因果關係的應用來說,這通常是數據分析的權威標準。

van Nood et al. (2013) NEJM

機理分析

極少有分析是以機理分析為目標的,機理分析是要去理解變量的精確變化和導致了其他變量精確變化的變量的過程。如果數據噪音不斷,會是這種分析更加困難。
機理分析最常應用的範圍,是在物理或工程學領域,利用一些簡單的模型,就可以描述許多動作。
一般來講,在進行分析時,相對於數據中的其他變化,測量誤差是唯一的隨機因素。

http://www.fhwa.dot.gov/resourcecenter/teams/pavement/pave_3pdg.pdf
上面這例子中,我們想要了解道路設計的不同和變化能直接導致道路功能發生什麼樣的變化。

What is Data? (5:15)
Data is a set of values of qualitative or quantitative variables.
首先我們需要一組對象,一組你將要進行測量的材料的集合,有時這種集合在統計推斷中也被稱為母體(population)。變量(variable)是指對象的測量指標或特徵,它可以是你測量出的一個人的身高或測量出的某人在某個網站停留的時間,另外,它也可以是定性的,比如它可以是此人在此網站瀏覽過的地方或你覺得訪問者的性別。
  • 定性變量(Qualitative)是諸如原產國、性別或治療方法之類的,它們不一定有序,也不一定是測量值。
  • 定量變量(Quantitative)通常是連續的,如身高、體重和血壓等,它們在特定範圍裡是有序的。

http://brianknaus.com/software/srtoolbox/s_4_1_sequence80.txt

https://dev.twitter.com/rest/reference/get/blocks/list

http://bluebuttontoolkit.healthit.gov/challenge/

How Many Computers to Identify a Cat? 16,000

Darwintunes
http://www.pnas.org/content/109/30/12081.full
https://soundcloud.com/uncoolbob/sets/darwintunes

http://www.data.gov/

數據應該與你回答的問題保持一致。通常數據會限制或幫助你回答問題,也就是說,當你提出一個問題,但你可能沒有能回答這問題的數據,於是你必須要調整你的問題,將它變為一系列可回答的子問題或相關問題。總而言之,如果你無法提出一個問題,僅擁有數據是寸步難行的。

先提出一個問題,然後利用手中的數據去分析得到答案。而不是擁有數據,才去發現問題。

What About Big Data? (4:15)



http://mashable.com/2011/06/28/data-infographic/

隨著時間的推移、科技的發展,大數據的概念也在發生變化。

因此,解決大數據問題的途徑之一,就是等到硬體的發展速度能跟上數據增長的速度的時候。
Six Degrees of Separation
Travers and Milgram (1969) Sociometry

Stanley Milgram做了一個實驗,他選取了296個實驗對象,向他們每個人寄了一封信,並要求他們將信轉寄給一個他們認識的人,依次傳遞下去,直到信件到達一個特定的地址。其中有64條這樣的傳遞鏈最終傳回了目的地,也就是296封信中收到了64封。他們從實驗中發現,在開始拿到這封信的人和最終皆受到信的人之間,在傳遞鏈上大約間隔5.2人。


Leskovec and Horvitz WWW '08


Don't use Hadoop - your data isn't that big
“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”
──John Tukey
“... no matter how big the data are.”
──Leek
Experimental Design (15:59)

http://www.nature.com/nm/journal/v12/n11/full/nm1491.html

http://arxiv.org/pdf/1010.1092.pdf

在進行任何實驗設計或是資料科學工程時,首先需要意識到的就是關心分析計畫。在研究的設計和分析中,很重要的是,你需要對各方面都加以關注,從資料清理到資料分析,再到形成報告,這樣你才不會陷入愚蠢尷尬的情境。更重要的是,你要意識到在研究設計中,那些使你犯錯誤的關鍵。

https://nsaunders.wordpress.com/2012/07/23/we-really-dont-care-what-statistical-method-you-used/

無論你進行什麼研究,你都需要有方案來共享你的資料或代碼:
當你實際進行實驗前,第一件要做的事情就是提前構思好你的問題。

http://www.wired.com/2012/04/ff_abtesting

http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture2.pdf

Chocolate Consumption, Cognitive Function, and Nobel Laureates
http://www.nejm.org/doi/full/10.1056/NEJMon1211064

上述這種例子的關係有時被稱為偽相關(Spurious Correlation)。

有幾種方法可以處理這些混雜因素:
  • 第一種方法是你可以固定一部份的變量,人們知道你固定了那個變量,所以它不可能是混雜因素。
  • 另一種方式是將變量分層。
  • 如果上述兩種方法都不能的話,你可以對它進行隨機化(randomize)
隨機化是指:你需要利用一個計算機程序或是拋硬幣的方式,將實驗對象分配到不同的組中。
http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture1.pdf

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/


http://xkcd.com/882/
  • 好的實驗設計應包括:
    • 重複試驗,這樣你可以測量資料的變化程度。
    • 把測得的變化程度和人們關注的信號比較。
    • 將它們推廣到你關心的問題上。
    • 代碼和資料的透明性。
  • 預測不同於推斷,兩者都很重要,使用哪種方法取決於你的具體情況。
  • 在任何資料科學問題中,都需意識到資料探勘(data dredging)的問題。

2015年6月10日 星期三

Coursera R Programming Week2 心得筆記

R 語言程序開發
約翰霍普金斯大學 公共衛生學院
R Programming
Johns Hopkins Bloomberg School of Public Health

Week 2
第二週學習筆記

2015年6月1日 - 2015年6月29日

R Programming: Week 2

Today marks the beginning of Week 2 of R Programming. This week we take the gloves off and the lectures cover key topics like control structures and functions. We also introduce the first programming assignment for the course, which is due at the end of the week.

A few notes about the Programming Assignment:

Each part of the Assignment can be submitted an infinite number of times---there is no limit on the number of submissions.
For each part, we take your maximum score over all of your submissions.
There is a submission script that you will have to download to submit your assignment.
Roger Peng and the Data Science Team
Mon 8 Jun 2015 8:01 AM CST

Week 2: Programming with R

This week is all about functions and about controlling the flow of an R program. We start with control structures (like if-else, and for loops) and then move on to writing functions. Next, we discuss the lexical scoping features of the language and how they can be used in interesting ways, particularly for statistical applications.

Learning Objectives

By the end of this week you should be able to:
  • Write an if-else expression
  • Write a for loop, a while loop, and a repeat loop
  • Define a function in R and specify its return value [see Functions Part 1 and Part 2]
  • Describe how R binds a value to a symbol via the search list
  • Define what lexical scoping is with respect to how the value of free variables are resolved in R
  • Describe the difference between lexical scoping and dynamic scoping rules
  • Convert a character string representing a date/time into an R datetime object. [see Dates and Times]

Programming

There is a graded programming assignment for this week.
  • Programming assignment 1: Air Pollution
For those interested in a bit of a "warm up" to this programming assignment, Derek Franks has written a very nice tutorial to help you get up to speed.


This programming assignment has multiple parts and you will submit your answers using the submit script described in the instructions.

2015年6月7日 星期日

Coursera The Data Scientist’s Toolbox Week2 心得筆記

數據科學家的工具箱
約翰霍普金斯大學 公共衛生學院
The Data Scientist’s Toolbox
Johns Hopkins Bloomberg School of Public Health

Week 2
第二週學習筆記

2015年6月1日 - 2015年6月29日

Video Lectures
Week 2 (50:50)

Tips from Coursera Users - Optional Video (3:53)

Command Line Interface (16:04)

Windows: Git Bash (除非熟悉Git,不然安裝的選項一律默認就好。Git Bash只有在Windows系統上可行!)
Mac/Linux: Terminal

/ (root):根目錄
~ (home):主目錄

CLI Commands
  • command 命令
  • flags 參數選項
  • arguments 執行對象

Summary of Commands
  • pwd (print working directory):輸出目前的工作目錄
  • clear:清除視窗
  • ls (list):列出所在工作目錄中的所有子目錄
  • ls -a:列出所有隱藏和未隱藏的文件夾
  • ls -al:顯示這個文件夾的詳細訊息
  • cd "目錄位址" (change directory):更改工作目錄,沒有輸入argument會預設主目錄
  • cd .. :進入上一級目錄
  • mkdir "目錄名稱" (make directory):創建目錄
  • touch "文件名稱":創建一個空文件
  • cp "文件名稱" "複製到哪個目錄" (copy):複製文件
  • cp -r "目錄名稱" "複製到哪個目錄" (copy):複製目錄
  • rm "文件名稱" (remove):刪除文件
  • rm -r "目錄名稱" (remove)刪除目錄
  • mv "文件名稱" "移動到哪個目錄" (move):移動文件
  • mv "文件名稱" "文件新名稱" (rename):更改文件名稱
  • echo "輸出值":列印出你的參數值
$ echo Hello World!
Hello World!
  • date:列印日期
$ date
Sun Jun  7 13:00:00     2015

Introduction to Git (4:49)

Version Control


版本控制系統是按時間記錄你對某個、某組文件所做的修改,方便你找回過去某個特定的版本。

Git是一個免費開源的版本控制系統,也是目前最流行、應用最廣泛的版本控制系統之一。

https://git-scm.com/book/en/v2/Getting-Started-A-Short-History-of-Git

$ git config --global user.name "Your Name Here"
$ git config --global user.email "your_email@example.com"
這個部分只需要進行一次,但你可以隨時更改。

$ git config --list  ## 可以看見你的用戶名、信箱以及其他訊息

$ exit  ## 退出Git Bash


Introduction to Github (3:53)

Git = Local (on your computer); GitHub = Remote (on the web)

GitHub是一個提供與軟體開發的網路主機服務,其使用Git版本控制系統作為核心。它能讓你在線開發項目,也能將項目上傳至網路上供其他人查閱和開發。

也就是說,它允許用戶對本地資源庫執行推送和拉取,對處於Git管理下的本地資源庫,你可以將它們推送至網路上的遠程資源庫中,或從遠程資源庫拉取回來。它同時提供每一個用戶一個主頁,當中列出了該用戶所有的資源庫。這些GitHub上的資源庫也會備份在服務器上,以防止你的本地庫發生意外。

但GitHub最主要的核心是在於它的社交功能,它允許用戶互相關注、分享及開發各自項目。

※ GitHub的帳號信箱要跟Coursera的信箱一樣

Creating a Github Repository (5:51)

創建資源庫(repo)
  • 創建一個全新的資源庫
  1. https://github.com/new
  2. 或在個人主頁(https://github.com/yourUserNameHere)的右上角建立。(如下圖)

※ 免費帳戶一律只能創建公開(Public)的資源庫。
 記得勾選"Initialize this repository with a README"的選擇框。

現在你可以在本地點腦上創建一個備份,先打開Git Bash,然後創建一個用來存放資源庫備份的文件夾


  • 創建一個基於其他用戶的資源庫的分叉(Fork)
建立分叉可以讓你和其他人合作開發軟體,它會在你的個人主頁裡創建一個該資源庫的備份。


git clone http://github.com/yourUserNameHere/repoNameHere.git

這個命令可以讓你獲得遠程服務器上的資源庫版本,它會複製在你當前的工作目錄下。

https://help.github.com/articles/fork-a-repo/
https://git-scm.com/book/it/v2/Git-Basics-Getting-a-Git-Repository

Basic Git Commands (5:52)

http://gitready.com/beginner/2009/01/21/pushing-and-pulling.html

git add .:把所有新文件添加到你現在的工作目錄
git add -u:更新那些被改名或被刪除的文件
git add -A:包含上述兩個命令
git commit -m "message":提交,注釋最好是關於此次更改的描述,這僅僅是本地的操作,不會更新到GitHub
git push:推送到GitHub

Fork及Branch的差別(http://wp.chunhsin.idv.tw/?p=4179
  1. Fork會另外複製一個版本,這個版本也是一個完整的套件。
  2. 官方說明文字裡指出,Fork主要是指要以其他人的套件為初始套件來開發時,或者要替他人的套件做出貢獻,也就是說通常是從其他Git帳號所擁有的套件複製而來的就是Fork。
  3. 如果是自己的套件,正確的作法應使用branch
  4. Fork底下還可以有Branch,但沒有Branch底下還有Fork這種狀況。
  5. 無論是Fork還是Branch的版本都可以合併至主要版本。唯一差別是Fork是向原作者送出merge的要求,尚需要原作者允許才可以合併,而branch因為是從自己的帳號分支出來的套件,所以不須另外允許。
git checkout -b "branchname":創建一個分支(Branch)
git branch:查看分支
git checkout master:切換回主分支

合併分叉(Fork)或分支(Branch),這功能只有在GitHub才有



如果是與別人合併,那他們會收到通知,如果他們同意修改,就會將你的請求整合到他們的資源庫。

Basic Markdown (2:22)

Markdown(.md)是一種以簡單、特定格式寫成的文件。GitHub、R及Rstudio都可以識別此格式。

Heading

## This is a secondary heading  // 第二級標題
### This is a tertiary heading  // 第三級標題

* first item in list  // 未排序列表第一項
* second item in list // 未排序列表第二項
* third item in list  // 未排序列表第三項

Getting markdown help

Installing R Packages (5:37)

http://cran.r-project.org/mirrors.html

http://www.bioconductor.org/(生物學及大型數據)

> a <- available.packages()
> head(rownames(a), 3)  ## Shoe the names of the first few packages
[1] "A3"          "abc"         "ABCanalysis"

http://cran.r-project.org/web/views/

> install.packages("slidify")  ## Installing an R Package

> install.packages(c("slidify, "ggplot2", "devtools"))

> source("http://bioconductor.org/biocLite.R")> biocLite()
> biocLite(c("GenomicFeatures", "AnnotationDbi"))
library():告訴R要載入哪個套件

> library(ggplot2)
> search()  ## 可以看見組成ggplot2的所有函數
 [1] ".GlobalEnv"        "package:ggplot2"   "tools:rstudio"    
 [4] "package:stats"     "package:graphics"  "package:grDevices"
 [7] "package:utils"     "package:datasets"  "package:methods"  
[10] "Autoloads"         "package:base" 

Installing Rtools (2:29)

這一節僅針對Windows用戶。

Rtools是在Windows下建構R套件時必備的一系列工具。
> find.package("devtools")
> install.packages("devtools")

> library(devtools)

然後輸入find_rtools(),應該返回一個TRUE。

2015年6月4日 星期四

Coursera R Programming Week1 心得筆記

R 語言程序開發
約翰霍普金斯大學 公共衛生學院
R Programming
Johns Hopkins Bloomberg School of Public Health

Week 1
課程介紹、第一週學習筆記

2015年6月1日 - 2015年6月29日

了解如何使用R進行編程以及如何使用R進行有效的數據分析。本課程是約翰霍普金斯(Johns Hopkins)數據科學專項課程的第二門課程。

課程類型
Information, Tech & Design
Statistics and Data Analysis

教授

Roger D. Peng, PhD - Johns Hopkins University
Jeff Leek, PhD - Johns Hopkins University
Brian Caffo, PhD - Johns Hopkins University

課程簡介
課程長度:4 weeks、7-9小時/週
語言:English
字幕:English, Español & 中文

課程概述
在本課程中,你將了解如何使用R進行編程以及如何使用R進行有效的數據分析。你將了解如何安裝和配置統計編程環境所需的軟件並說明通用的編程語言概念,因為該語言要在高級統計語言中進行實施。該課程涵蓋了統計計算中的實際問題,其中包括使用R進行編程、將數據讀入R、訪問R程序包、編寫R函數、調試、分析R代碼,以及組織和說明R代碼。統計數據分析主題將會提供使用示例。 

授課大綱
  • 第1週:R概覽(Overview of R)、R數據類型和對象(R data types and objects)、讀取和寫入數據(reading and writing data)
  • 第2週:控制結構(Control structures)、函數(functions)、作用域規則(scoping rules)、日期和時間(dates and times)
  • 第3週:循環函數(Loop functions)、調試工具(debugging tools)
  • 第4週:模擬(Simulation)、代碼分析(code profiling)

先修知識
熟悉編程概念以及統計推理基礎知識會有一定的幫助作用:數據科學工具箱

參考資料
The e-book R Programming for Data Science covers all of the material presented in this course. It is available for download from Leanpub.

授課形式
課程每週都會有視頻、測驗和編程作業。  

作為本課程的一部分,你需要設置GitHub帳戶。 Github是一種工具,用於共享和修改協作代碼。在學習本課程及本專項課程其他課程的過程中,你需要提交自己公開放置在Github帳戶下的文件鏈接,作為同伴互評作業的一部分。如果你擔心自己的身份被他人得知,那麼你需要註冊一個Github匿名帳戶,並且,切記不要添加你不想讓評估的同學看到的信息。

Data Science Specialization Community Site

Since the beginning of the Data Science Specialization, we've noticed the unbelievable passion students have about our courses and the generosity they show toward each other on the course forums. A couple students have created quality content around the subjects we discuss, and many of these materials are so good we feel that they should be shared with all of our students.

We're excited to announce that we've created a site using GitHub Pages: http://datasciencespecialization.github.io/ to serve as a directory for content that the community has created. If you've created materials relating to any of the courses in the Data Science Specialization, please send us a pull request so we can add a link to your content on our site. You can find out more about contributing here: https://github.com/DataScienceSpecialization/DataScienceSpecialization.github.io#contributing

We can't wait to see what you've created and where the community can take this site!
- The JHU Data Science Lab Team
Thu 4 Jun 2015 4:30 AM CST

R Programming: Week 1

As you browse the course web site, please make sure to read through the syllabus which contains important information about the grading policy for quizzes and programming assignments as well as the course schedule. 

Please pay particular attention to the differences among the various Programming Assignments. Whereas Programming Assignments 1 and 3 are graded via unit tests that use a submission script that will compare the output of your functions to the correct output, Programming Assignment 2 requires that you submit R code for evaluation and grading by your classmates.  

This week will cover the basics to get you started up with R. There are videos demonstrating how to install R on Windows and Mac. The Week 1 videos will cover the history of R and S, go over the basic data types in R, and describe the functions for reading and writing data. I recommend that you watch the videos in the order that they are listed on the web page, but watching the videos out of order isn't going to ruin the story. For each lecture video you can download a separate PDF document of the slides (the demonstration videos don't have slides associated with them).
Roger Peng and the Data Science Team
Mon 1 Jun 2015 9:01 PM CST

R Programming: Welcome to swirl

In this course, you have the option to use the swirl R package to practice some of the concepts we cover in lectures. swirl teaches you R programming and data science interactively, at your own pace, and right in the R console.

Each lesson that you complete in swirl is worth one extra credit point. However, the maximum number of points you may earn for the assignment is capped at 5. While these lessons will give you valuable practice and you are encouraged to complete as many as possible, please note that they are completely optional and you can get full marks in the class without completing them.

You can find the instructions for how to install and use swirl in the Programming Assignments section of the course under Week 1. Have fun!
Roger Peng and the Data Science Team
Mon 1 Jun 2015 9:01 PM CST

R Programming: Pre-Course Survey

Thanks for signing up for R Programming. As you probably know, this course is part of the Data Science Specialization, a sequence of nine massive open online courses (MOOCs) plus a Capstone project. We would like to learn more about your motives for taking this course and your intentions for both this course and the overall specialization. To help us out, please complete our short pre-course survey. It should only take about 3 minutes of your time.

Thanks,
The Data Science Team
Mon 1 Jun 2015 9:01 PM CST

Assessments
Quizzes

There will be one quiz every week. The quizzes will all open on the first day of the course but they will be due weekly. So the Week 1 Quiz will be due at the end of the first week and the Week 2 Quiz will be due at the end of the second week, etc.

Please refer to the individual weekly Quiz deadlines to see the exact date and time that each Quiz is due.

Programming Assignments

There will be three required programming assignments. The first programming assignment is due at the end of the second week. Subsequent programming assignments are due weekly after that.

Programming Assignments 1 and 3 will be graded via unit tests using a submission script that will compare the output of your functions to the correct output. To access Programming Assignments 1 and 3, click the corresponding link in the left navigation bar.

Programming Assignment 2 will be submitted differently and graded via a peer assessment. To access Programming Assignment 2, click the corresponding link in the left navigation bar.

swirl Programming Assignment (optional)

In this course, you have the option to use the swirl R package to practice some of the concepts we cover in lectures.

Each lesson that you complete in swirl is worth one extra credit point. However, the maximum number of points you may earn for the assignment is capped at 5. While these lessons will give you valuable practice and you are encouraged to complete as many as possible, please note that they are completely optional and you can get full marks in the class without completing them.

You can find the instructions for how to install and use swirl in the Programming Assignments section of the course under Week 1.

Grading
Quizzes

You may attempt each quiz up to 3 times. The score of your most successful atempt will count toward your grade.

Programming Assignments

Programming assignments 1 and 3 will require submissions via a submission script. You may make an unlimited number of submissions for of the programming assignments 1 and 3, and your most successful submission will count toward your grade. The swirl Programming Assignment is completely optional and extra credit.

Hard deadlines and soft deadlines for Quizzes 1-3 and Programming Assignment 1

The reported due date is the soft deadline for quizzes 1-3 and programming assignment 1. You may turn in quizzes 1-3 and programming assignment 1 up to five days after the soft deadline. The hard deadline is five days after the Quiz is due at 23:30 UTC. If you submit after the due date (but before the hard deadline), your submission score will be penalized by 10% for each day after the due date. If you use a late day, the 10% per day penalty will not be applied for that day.

**Please note: There is no partial credit grace period for Quiz 4 or Programming Assignments 2 and 3. Those deadlines are firm, and work submitted after the hard deadline will not receive credit.

Late Days for Quizzes and Programming Assignment 1

You are permitted a total of 5 late days for quizzes and assignments in the course. If you use a late day, your quiz or assignment grade will not be affected if it is submitted late.

No Late Days for Programming Assignment 2

Peer assessments deadlines have to be synchronous. Therefore, Late Days cannot be applied to Programming Assignment 2. Only one deadline can be set for students to submit and peer-grade each other's work. This is necessary in order to maintain a synchronized peer grading process.

Points and scoring

There are 100 points available in the course. The breakdown of points is as follows:
  • Week 1 Quiz - 20 points
  • Week 2 Quiz - 10 points
  • Week 3 Quiz - 5 points
  • Week 4 Quiz - 10 points
  • Programming Assignment 1 (Air Pollution) - 20 points
  • Programming Assignment 2 (Lexical Scoping) - 10 points
  • Programming Assignment 3 (Hospital Quality) - 25 points
  • swirl Programming Assignment - Maximum of 5 extra credit points
You must earn 70 points to pass the course and earn a certificate. Students who earn 90 points and above will receive a certificate with Distinction.

2015年6月3日 星期三

Coursera The Data Scientist’s Toolbox Week1 心得筆記

數據科學家的工具箱
約翰霍普金斯大學 公共衛生學院
The Data Scientist’s Toolbox
Johns Hopkins Bloomberg School of Public Health

Week 1
課程介紹、第一週學習筆記

2015年6月1日 - 2015年6月29日

大體了解數據分析師和數據科學家處理的數據、問題以及使用的工具。本課程是約翰霍普金斯(Johns Hopkins)數據科學專項課程的第一門課程。

課程類型
Statistics and Data Analysis

教授
Jeff Leek, PhD - Johns Hopkins University
Roger D. Peng, PhD - Johns Hopkins University
Brian Caffo, PhD - Johns Hopkins University

課程簡介
課程長度:4 weeks、1-4小時/週
語言:English
字幕:Português, English, ελληνικά, 中文 & Pусский язык

課程概述
本課程將簡要介紹數據師的工具箱的主要工具和概念。通過本課程,你可以大體了解數據分析師和數據師處理的數據、問題以及使用的工具。課程由兩部分構成。第一部分介紹將數據轉化為實際知識所運用的概念。第二部分實際介紹在version control、markdown、git、GitHub、R和RStudio等程序中使用的工具。

授課大綱
完成本課程後,你將有能力識別數據科學問題並將其分類。同時,你也將創建好Github帳戶、創建自己的首個資源庫,並將自己的首個markdown文件推送至帳戶。

先修知識
無需預修課程。之前的編程經驗會對課程十分有用。

授課形式
本課程包括每週課程視頻、每週測驗以及最後的同伴互評作業項目。

常見問題解答
數據科學專項課程之間有什麼樣的依賴關係? 
我們創建了一個簡便的課程依賴關係圖表以幫助你查看專項課程中的九門課程之間的依賴關係。 

完成本課程後,我會得到完成聲明嗎? 
是的。成功完成本課程的學生將得到由授課教師簽發的完成聲明。 

完成本課程所需的資源有哪些? 
對於本課程,你只需擁有互聯網連接和Github的訪問權限。 

本課程在數據科學專項課程中的位置是什麼? 
按先後順序,這是本專項課程的第一課。我們建議你在學習R編程或本專項課程的其他課程之前,首先學習本課程。