Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Agenda Course 02402/02323 Introduction to Statistics Lecture 1: Introduction to Statistics Per Bruun Brockhoff 1 Practical course information 2 3 Introduction to Statistics - a primer Intro Case stories: IBM Big data, Novo Nordisk small data, Skive fjord 4 Introduction to Statistics 5 Descriptive Statistics: Summary Statistics Mean Median Variance and standard deviation Quantiles Covariance and correlation Plots/figures 6 Software: R DTU Compute Danish Technical University 2800 Lyngby – Denmark e-mail: [email protected] Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 1 / 49 Practical course information Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 2 / 49 Fall 2015 5 / 49 Practical course information Practical course information Practical Information Teaching module: Tuesdays 8-12 /(02323: Fridays 8.00-12.00) Generic weekly agenda: Homepage: introstat.compute.dtu.dk Online eNote Syllabus, Lecture plan Exercises & solutions Slides Podcasts of lectures (In English AND Danish) Quizzes BEFORE teaching module: Read announced stuff 2 hours long lectures (curriculum of the week) 2 hours of exercises (Mix of: Enote and online quiz-questions) AFTER teaching module: Test yourself by online exam quiz. Campusnet: www.campusnet.dtu.dk Exam: 4 hour multiple choice, Sunday 13/12 MANDATORY projects: 2 must be approved to be able to go to the exam. Messages and (certain) file sharings Links to interesting stories Projects - description AND submission Each project will have 3 optional versions! Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 4 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Introduction to Statistics - a primer Introduction to Statistics - a primer Introduction to Statistics - a primer Millennium list New England Journal of medicine: EDITORIAL: Looking Back on the Millennium in Medicine, N Engl J Med, 342:42-49, January 6, 2000. http: //www.nejm.org/doi/full/10.1056/NEJM200001063420108 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 7 / 49 Elucidation of Human Anatomy and Physiology Discovery of Cells and Their Substructures Elucidation of the Chemistry of Life Application of Statistics to Medicine Development of Anesthesia Discovery of the Relation of Microbes to Disease Elucidation of Inheritance and Genetics Knowledge of the Immune System Development of Body Imaging Discovery of Antimicrobial Agents Development of Molecular Pharmacotherapy Per Bruun Brockhoff ([email protected]) Introduction to Statistics - a primer Introduction to Statistics, Lecture 1 Fall 2015 8 / 49 Introduction to Statistics - a primer James Lind John Snow "One of the earliest clinical trials took place in 1747, when James Lind treated 12 scorbutic ship passengers with cider, an elixir of vitriol, vinegar, sea water, oranges and lemons, or an electuary recommended by the ship’s surgeon. The success of the citrus-containing treatment eventually led the British Admiralty to mandate the provision of lime juice to all sailors, thereby eliminating scurvy from the navy." (See also http://en.wikipedia.org/wiki/James_Lind). "The origin of modern epidemiology is often traced to 1854, when John Snow demonstrated the transmission of cholera from contaminated water by analyzing disease rates among citizens served by the Broad Street Pump in London’s Golden Square. He arrested the further spread of the disease by removing the pump handle from the polluted well." (See also http://en.wikipedia.org/wiki/John_Snow_(physician)). Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 9 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 10 / 49 Introduction to Statistics - a primer Introduction to Statistics - a primer Google - Big Data IBM - Big Data A quote from New York Times, 5. August 2009, from the article titled "For Today’s Graduate, Just One Word: Statistics” is: "The key is to let computers do what they are good at, which is trawling these massive data sets for something that is mathematically odd," said Daniel Gruhl, an I.B.M. researcher whose recent work includes mining medical data to improve treatment. "And that makes it easier for humans to do what they are good at - explain those anomalies." "I keep saying that the sexy job in the next 10 years will be statisticians," said Hal Varian, chief economist at Google. "And I’m not kidding."" (And Politiken, 12/2 2014 - see links in CampusNet) Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 11 / 49 Intro Case stories: IBM Big data, Novo Nordisk small data, Skive fjord Presentation by Senior Scientist Hanne Refsgaard, Novo Nordisk A/S IBM Social Media podcast by Henrik H. Eliassen, IBM. Skive Fjord podcasts, by Jan K. Møller, DTU. Introduction to Statistics, Lecture 1 Introduction to Statistics, Lecture 1 Fall 2015 12 / 49 Introduction to Statistics Intro Case stories: IBM Big data, Novo Nordisk small data, Skive fjord Per Bruun Brockhoff ([email protected]) Per Bruun Brockhoff ([email protected]) Fall 2015 14 / 49 Introduction to Statistics How to treat (or analyse) data? What is random variation? Statistics is a tool for making decisions: How many computers did we sell last year? What is the expected price of a share? Is machine A more effective than machine B ? Statistics can be used Statistics can be used in most disciplines and is therefore a very important tool Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 16 / 49 Introduction to Statistics Introduction to Statistics Statistics and Engineers Statistics Statistics is an important tool in problem solving Data analysis Quality improvement Design of experiments Predictions of future values .. and much more! Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 Modern statistics Modern statistics are based on theory of probabilities and descriptive statistics. 17 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics Fall 2015 18 / 49 Fall 2015 20 / 49 Introduction to Statistics Statistics Statistics Statistics is often about analyzing a sample, that is taken from a population Based on the sample, we try to generalize (or comment on) the population Therefore it is important that the sample is representative of the population Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Introduction to Statistics, Lecture 1 Fall 2015 19 / 49 Selected at random Sample Sample Population Statistical inference Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Descriptive Statistics: Summary Statistics Descriptive Statistics: Summary Statistics Summary statistics Mean We use a number of summary statistics to summarize and describe data (stochastic variables) Mean x̄ Median Variance s2 Standard deviation s Percentiles Per Bruun Brockhoff ([email protected]) Mean We say that x̄ is an estimate of the mean value Introduction to Statistics, Lecture 1 Descriptive Statistics: Summary Statistics The mean value is a key number that indicates the centre of gravity or centering of the data The mean: n 1X xi x̄ = n i=1 Fall 2015 22 / 49 Per Bruun Brockhoff ([email protected]) Median Introduction to Statistics, Lecture 1 Descriptive Statistics: Summary Statistics Median Fall 2015 23 / 49 Example Example: Student heights: Sample: x <- c(185, 184, 194, 180, 182) The median is also a key number, indicating the center of the data. In some cases, for example in the case of extreme values, the median is preferable to the mean Median: The observation in the middle (in sorted order) n=5 mean: 1 x = (185 + 184 + 194 + 180 + 182) = 185 5 Median, first order data: 180 182 184 185 194 And then chosse the middle (since n is uneven)(3’th) number: 184 What if a person on 235cm is added to the data: Median = 184.5 Mean = 193.33 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 24 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 25 / 49 Descriptive Statistics: Summary Statistics Variance and standard deviation Descriptive Statistics: Summary Statistics Variance and standard deviation Example: Student heights: The variance (or the standard deviation) indicates the spread of the data: Variance n s2 = 1 X (xi − x̄)2 n−1 Data n=5: 185 184 194 180 182 Varians, s2 = 1 ((185 − 185)2 + (184 − 185)2 + (194 − 185)2 + (180 − 185)2 4 i=1 Standard deviation +(182 − 185)2 ) v u √ u 2 s= s =t n 1 X (xi − x̄)2 n−1 Introduction to Statistics, Lecture 1 Descriptive Statistics: Summary Statistics √ = 29 s2 = √ s = 29 = 5.385 Standard deviation, s = i=1 Per Bruun Brockhoff ([email protected]) Variance and standard deviation Fall 2015 26 / 49 Variance and standard deviation Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Descriptive Statistics: Summary Statistics The coefficient of variation Fall 2015 27 / 49 Quantiles Percentiles=quantiles The standard deviation and the variance are key numbers for absolute variation. If it is of interest to compare variation between different data sets, it might be a good idea to use a relative key number, the coefficient of variation: s V = · 100 x̄ The median it the point that divides the data into two halves. It is of course possible to find other points that divide the data in other parts, they are called percentiles. Often calculated percentiles are 0, 25, 50, 75, 100 % percentiles (quartiles) and/or 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 % percentiles Note: the 50% percentile is the median Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 28 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 29 / 49 Descriptive Statistics: Summary Statistics Quantiles Descriptive Statistics: Summary Statistics Quantiles, Definition 1.6 Quantiles Example: Student heights: The p0 th quantile also named the 100p’th percentile, can be defined by the following procedure: 1 Order the n observations from smallest to largest: x(1) , . . . , x(n) . 2 Compute pn. 3 If pn is an integer: Average the pn’th and (pn + 1)’th ordered observations: The p’th quantile = x(np) + x(np+1) /2 4 (1) If pn is a non-integer, take the “next one” in the ordered list: The p’th quantile = x(dnpe) (2) Data, n=5: 185 184 194 180 182 Lower quartile, Q1 , first order the data: 180 182 184 185 194 Then choose the right based on np = 1.25: Q1 = 182 Upper quartile, Q3 , first order the data: 180 182 184 185 194 Then choose the right based on np = 3.75: Q3 = 185 where dnpe is the ceiling of np, that is, the smallest integer larger than np. Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Descriptive Statistics: Summary Statistics Fall 2015 30 / 49 Covariance and correlation 168 65.5 161 58.3 167 68.1 179 85.7 184 80.5 166 63.4 198 102.6 Introduction to Statistics, Lecture 1 Descriptive Statistics: Summary Statistics Covariance and correlation - measuring relation Heights (xi ) Weights (yi ) Per Bruun Brockhoff ([email protected]) 187 91.4 Fall 2015 31 / 49 Covariance and correlation Covariance and correlation - Def. 1.17 and 1.18 191 86.7 179 78.9 The sample covariance is given by n sxy 100 7 90 80 Weight 70 3 6 60 10 The sample correlation coefficient is given by 9 n 5 1 X r= n−1 i=1 xi − x̄ sx yi − ȳ sy = sxy sx · sy (4) 1 where sx and sy is the sample standard deviation for x and y respectively. x = 178 2 160 (3) i=1 8 4 y = 78.1 1 X = (xi − x̄) (yi − ȳ) n−1 170 180 190 Height Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 32 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 33 / 49 Descriptive Statistics: Summary Statistics Covariance and correlation Descriptive Statistics: Summary Statistics Covariance and correlation - measuring relation Student Heights (xi ) Weights (yi ) (xi − x̄) (yi − ȳ) (xi − x̄)(yi − ȳ) 1 168 65.5 -10 -12.6 126.1 2 161 58.3 -17 -19.8 336.8 3 167 68.1 -11 -10 110.1 4 179 85.7 1 7.6 7.6 5 184 80.5 6 2.4 14.3 6 166 63.4 -12 -14.7 176.5 7 198 102.6 20 24.5 489.8 Correlation - properties 8 187 91.4 9 13.3 119.6 9 191 86.7 13 8.6 111.7 10 179 78.9 1 0.8 0.8 r is always between −1 and 1: −1 ≤ r ≤ 1 1 (126.1 + 336.8 + 110.1 + 7.6 + 14.3 + 176.5 + 489.8 9 +119.6 + 111.7 + 0.8) 1 = · 1493.3 9 = 165.9 sxy = r measures the degree of linear relation between x and y r = ±1 if and only if all points in the scatterplot are exactly on a line r > 0 if and only if the general trend in the scatterplot is positive r < 0 if and only if the general trend in the scatterplot is negative and sy = 14.07 sx = 12.21, 165.9 = 0.97 12.21 · 14.07 r= Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Descriptive Statistics: Summary Statistics Fall 2015 34 / 49 Covariance and correlation Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Descriptive Statistics: Summary Statistics Correlation Fall 2015 35 / 49 Fall 2015 37 / 49 Plots/figures Figures/Tables r ≈ − 0.5 1.2 r ≈ 0.95 ● ● ● ● 0.4 ● ●● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● Scatter plot (xy plot) Histogram Cumulative distribution Boxplots ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 ● 2 ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 1.0 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● Count data: ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● 0.0 −3 0.4 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● 0.8 1.0 0.0 0.2 0.4 x Per Bruun Brockhoff ([email protected]) ●● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● Bar charts Pie charts ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● ● ●● ● ● ● ● y y ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● 0.4 ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 x ● ● ● ● 0.8 r≈0 ● ● ● ● 0.0 ● ● ● ● ● Quantitative data: ● ● ● ● x 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● r≈0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● 0.0 −1 ● ● ● ● ●● ● ●● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● −2 ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ●●● ● y 0.8 0.6 ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 1 1.0 ● ● ● ● ● ● ●● ● 0.0 ● ● ● ● y Covariance and correlation Introduction to Statistics, Lecture 1 0.6 0.8 1.0 x Fall 2015 36 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Software: R Software: R Software: R Software: R > ## Adding numbers in the console > 2+3 Install R and Rstudio [1] 5 Intro to basic computing > y <- 3 Introduced in the eNote We use in an integrated way throughout the course and material Globalt rapidly growing open source computing environment > x <- c(1, 4, 6, 2) > x WAARRRNIING: R CANNOT substitute our brains!!!! (Read section 1.5.4!) [1] 1 4 6 2 > x <- 1:10 > x [1] 1 2 3 4 5 6 7 8 9 10 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 39 / 49 Per Bruun Brockhoff ([email protected]) Software: R Introduction to Statistics, Lecture 1 Fall 2015 40 / 49 Software: R Software: R Software: R ## Sample Mean and Median (data from eNote) x <- c(168,161,167,179,184,166,198,187,191,179) mean(x) ## Sample quartiles quantile(x,type=2) [1] 178 ## ## median(x) [1] 179 0% 161 25% 167 50% 179 75% 100% 187 198 ## Sample quantiles 0%, 10%,..,90%, 100%: quantile(x,probs=seq(0, 1, by=0.10),type=2) ## Sample variance and standard deviation var(x) ## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ## 161.0 163.5 166.5 168.0 173.5 179.0 184.0 187.0 189.0 194.5 198.0 [1] 149.11 sd(x) [1] 12.211 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 41 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 42 / 49 Software: R Software: R Software: R Software: R ## A histogram of the heights: hist(x) ## A density histogram of the heights: hist(x,freq=FALSE,col="red",nclass=8) Histogram of x 0 0.04 0.00 0.02 Density 2 1 Frequency 3 4 0.06 Histogram of x 160 170 180 190 200 160 170 180 x Per Bruun Brockhoff ([email protected]) 190 200 x Introduction to Statistics, Lecture 1 Fall 2015 43 / 49 Per Bruun Brockhoff ([email protected]) Software: R Introduction to Statistics, Lecture 1 Fall 2015 44 / 49 Software: R Software: R Software: R plot(ecdf(x),verticals=TRUE) ## A basic boxplot of the heights: (range=0 makes it "basic") boxplot(x,range=0,col="red",main="Basic boxplot") text(1.3,quantile(x),c("Minimum","Q1","Median","Q3","Maximum"), col="blue") 1.0 ecdf(x) ● 0.8 ● Basic boxplot ● Maximum 0.6 190 ● 180 Q3 0.4 Fn(x) ● ● Median 170 0.2 ● ● 0.0 160 ● 160 Per Bruun Brockhoff ([email protected]) 170 180 190 Introduction to Statistics, Lecture 1 x Q1 Minimum 200 Fall 2015 45 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 46 / 49 Software: R Software: R Software: R Next week Next week: ## A modified boxplot of the heights with an ## extreme observation, 235cm added: ## The modified version is the default boxplot(c(x,235),col="red",main="Modified boxplot") text(1.3,quantile(c(x,235)),c("Minimum","Q1","Median","Q3" ,"Maximum"),col="blue") Probability, part 1 - eNote chapter 2. Modified boxplot Maximum 200 220 ● 180 160 Q3 Median Q1 Minimum Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Software: R Fall 2015 47 / 49 Next week Agenda 1 Practical course information 2 3 Introduction to Statistics - a primer Intro Case stories: IBM Big data, Novo Nordisk small data, Skive fjord 4 Introduction to Statistics 5 Descriptive Statistics: Summary Statistics Mean Median Variance and standard deviation Quantiles Covariance and correlation Plots/figures 6 Software: R Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 49 / 49 Per Bruun Brockhoff ([email protected]) Introduction to Statistics, Lecture 1 Fall 2015 48 / 49