class: center, middle # Regresión múltiple ### Análisis estadístico utilizando R <img src="https://miro.medium.com/max/1400/1*guak1sQTh5sAf46NMzbQig.jpeg" width="30%" /> UNQ UNTreF CONICET Ignacio Spiousas [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#A42339;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg>](https://github.com/spiousas) [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#A42339;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg>](https://twitter.com/Spiousas) Pablo Etchemendy [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:black;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg>](https://github.com/https://github.com/petcheme) [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#black;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg>](https://twitter.com/petcheme) 2021-08-20 --- class: left, top, highlight-last-item # Un ejemplo Volvamos a los pingüinos... ```r penguins_adelie <- penguins %>% drop_na() %>% filter(species == "Adelie") ``` <img src="linear_models_3_files/figure-html/unnamed-chunk-2-1.png" width="80%" style="display: block; margin: auto;" /> --- class: left, top, highlight-last-item # Un ejemplo Podemos aprovechar que ambas variables tienen una relación lineal con el peso y cambiar nuestro modelo por: `$$\hat{Peso} = b_0 + b_1 LargoPico + b_2 Aleta$$` .pull-left[ Que en **R** lo ajustamos de la siguiente forma: ```r m2<- penguins_adelie %>% lm(body_mass_g ~ bill_length_mm + flipper_length_mm, .) m2 ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm, ## data = .) ## ## Coefficients: ## (Intercept) bill_length_mm flipper_length_mm ## -3492.00 75.48 22.45 ``` El ajuste de cuadrados mínimos nos da: `$$\hat{Peso} = -3492.00 + 75.48 LargoPico + 22.45 Aleta$$` ] -- .pull-left[ ¿Es un *mejor* ajuste que con el largo de pico solo? ```r m1<- penguins_adelie %>% lm(body_mass_g ~ bill_length_mm, .) m1 ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_length_mm, data = .) ## ## Coefficients: ## (Intercept) bill_length_mm ## 66.45 93.75 ``` El ajuste de cuadrados mínimos nos da: `$$\hat{Peso} = -66.45 + 93.75 LargoPico$$` ] --- class: left, top, highlight-last-item # ¿Cuán bueno es el ajuste? Veamos el `\(R^2\)` de cada modelo .pull-left[ `$$\hat{Peso} = -3492.00 + 75.48 LargoPico + 22.45 LargoAleta$$` ```r summary(m2) ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm, ## data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -776.47 -231.80 -42.06 247.72 1007.17 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3491.995 889.597 -3.925 0.000134 *** ## bill_length_mm 75.477 11.958 6.312 3.25e-09 *** ## flipper_length_mm 22.450 4.882 4.599 9.28e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 361.6 on 143 degrees of freedom ## Multiple R-squared: 0.3869, Adjusted R-squared: 0.3783 ## F-statistic: 45.12 on 2 and 143 DF, p-value: 6.43e-16 ``` ] .pull-left[ `$$\hat{Peso} = -66.45 + 93.75 LargoPico$$` ```r summary(m1) ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_length_mm, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -785.17 -259.39 -7.04 250.77 1089.83 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 66.45 468.59 0.142 0.887 ## bill_length_mm 93.75 12.04 7.786 1.24e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 386.1 on 144 degrees of freedom ## Multiple R-squared: 0.2962, Adjusted R-squared: 0.2913 ## F-statistic: 60.61 on 1 and 144 DF, p-value: 1.242e-12 ``` ] --- class: left, top, highlight-last-item # ¿Cuán bueno es el ajuste? Recordemos la deficinión de `\(R^2\)` `$$R^2 = \frac{sd^2 - sd^2_{residuos}}{sd^2}$$` `$$R^2 = 1 - \frac{sd^2_{residuos}}{sd^2}$$` Es esperable que, al agergar predictores, el `\(R^2\)` sea igual o mayor Para tener en cuenta el agregado de parámetros, existe el `\(R^2\)` ajustado `$$R_{adj}^2 = 1 - \frac{sd^2_{residuos} / (n-k-1)}{sd^2 / (n-1)}$$` donde: - **n** es el número de observaciones - **k** es el número de predictores `$$R_{adj}^2 = 1 - \frac{sd^2_{residuos}}{sd^2} \times \frac{(n-1)}{(n-k-1)}$$` --- class: left, top, highlight-last-item # ¿Cuán bueno es el ajuste? Volvamos a los modelos y miremos el `\(R^2\)` ajustado .pull-left[ `$$\hat{Peso} = -3492.00 + 75.48 LargoPico + 22.45 LargoAleta$$` ```r summary(m2) ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm, ## data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -776.47 -231.80 -42.06 247.72 1007.17 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3491.995 889.597 -3.925 0.000134 *** ## bill_length_mm 75.477 11.958 6.312 3.25e-09 *** ## flipper_length_mm 22.450 4.882 4.599 9.28e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 361.6 on 143 degrees of freedom ## Multiple R-squared: 0.3869, Adjusted R-squared: 0.3783 ## F-statistic: 45.12 on 2 and 143 DF, p-value: 6.43e-16 ``` ] .pull-left[ `$$\hat{Peso} = -66.45 + 93.75 LargoPico$$` ```r summary(m1) ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_length_mm, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -785.17 -259.39 -7.04 250.77 1089.83 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 66.45 468.59 0.142 0.887 ## bill_length_mm 93.75 12.04 7.786 1.24e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 386.1 on 144 degrees of freedom ## Multiple R-squared: 0.2962, Adjusted R-squared: 0.2913 ## F-statistic: 60.61 on 1 and 144 DF, p-value: 1.242e-12 ``` ] --- class: left, top, highlight-last-item # Variables categóricas Miremos ahora unos datos ligeramente diferentes: ```r penguins_adelie_gentoo <- penguins %>% drop_na() %>% filter(species %in% c("Adelie", "Gentoo")) ``` .pull-left[ Miremos la relación de largo de pico y peso: <img src="linear_models_3_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> Pareciera haber dos pendientes ¿No? ] -- .pull-right[ <img src="linear_models_3_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" /> Agreguemos el predictor categórico **species** al modelo ] --- class: left, top, highlight-last-item # Variables categóricas En este caso la exuación va a ser: `$$\hat{Peso} = b_0 + b_1 AnchoPico + b_2 Especie$$` Donde especie va a valer 0 para una especie y 1 para otra. .pull-left[ Miremos la relación de largo de pico y peso: ```r m2<- penguins_adelie_gentoo%>% lm(body_mass_g ~ bill_depth_mm + species, .) summary(m2) ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_depth_mm + species, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -849.44 -259.45 -31.32 255.49 1195.69 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1249.97 375.83 -3.326 0.00101 ** ## bill_depth_mm 270.13 20.42 13.231 < 2e-16 *** ## speciesGentoo 2291.37 82.34 27.829 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 371 on 262 degrees of freedom ## Multiple R-squared: 0.8062, Adjusted R-squared: 0.8048 ## F-statistic: 545.1 on 2 and 262 DF, p-value: < 2.2e-16 ``` ] .pull-right[ Y si lo comparamos con el modelos sin **species** ```r m1 <- penguins_adelie_gentoo%>% lm(body_mass_g ~ bill_depth_mm, .) summary(m1) ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_depth_mm, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1578.62 -539.24 -27.63 520.42 1753.09 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7729.34 382.60 20.202 <2e-16 *** ## bill_depth_mm -201.91 22.56 -8.951 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 736.6 on 263 degrees of freedom ## Multiple R-squared: 0.2335, Adjusted R-squared: 0.2306 ## F-statistic: 80.12 on 1 and 263 DF, p-value: < 2.2e-16 ``` ] --- class: left, top, highlight-last-item # Variables categóricas En este caso la ecuación va a ser: `$$\hat{Peso} = b_0 + b_1 AnchoPico + b_2 Especie$$` Donde especie va a valer 0 para una especie y 1 para otra. ```r m2<- penguins_adelie_gentoo%>% lm(body_mass_g ~ bill_depth_mm + species, .) m2 ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_depth_mm + species, data = .) ## ## Coefficients: ## (Intercept) bill_depth_mm speciesGentoo ## -1250.0 270.1 2291.4 ``` `$$\hat{Peso} = -1250.0 + 270.1 AnchoPico + 2291.4 Especie$$` Por ejemplo: .pull-left[ Un pingüino **Adelie** de **ancho de pico 18 mm** va atener un peso estimado de: `$$\hat{Peso} = -1250.0 + 270.1 · 18 + 2291.4 · 0 = 3611.8 g$$` ] .pull-right[ Mientras que si es **Gentoo**: `$$\hat{Peso} = -1250.0 + 270.1 · 18 + 2291.4 · 1 = 5903.2 g$$` ] --- class: left, top, highlight-last-item # Variables categóricas ¿Y si tenemos más de dos niveles en la variable **species**? ```r m2<- penguins %>% lm(body_mass_g ~ bill_depth_mm + species, .) m2 ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_depth_mm + species, data = .) ## ## Coefficients: ## (Intercept) bill_depth_mm speciesChinstrap speciesGentoo ## -1007.28 256.61 13.38 2238.67 ``` `$$\hat{Peso} = b_0 + b_1 AnchoPico + b_2 EspecieChinstrap + b_3 EspecieGentoo$$` `$$\hat{Peso} = -1007.28 + 256.61 AnchoPico + 13.38 EspecieChinstrap + 2238.67 EspecieGentoo$$` **Adelie** es el nivel de base - *EspecieChinstrap* vale 1 si es **Chinstrap** y 0 si no - *EspecieGentoo* vale 1 si es **Gentoo** y 0 si no Es importante notar qu, si bien agregamos un predictor, agregamos dos parámetros. `$$R_{adj}^2 = 1 - \frac{sd^2_{residuos}}{sd^2} \times \frac{(n-1)}{(n-k-1)}$$` --- class: left, top, highlight-last-item # Interacciones ¿Hasta ahora vimos modelos únicamente con contribuciones aditivas, pero que pasa si tenemos una **interacción**? Una **interacción** es un efecto que depende de más de una variable, por ejemplo: `$$\hat{Peso} = b_0 + b_1· AnchoPico + b_2 · EspecieGentoo + b_3 · AnchoPico \times EspecieGentoo$$` .pull-left[ Por ejemplo: ```r m2<- penguins_adelie_gentoo %>% lm(body_mass_g ~ bill_length_mm + species + bill_length_mm:species, .) m2 ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ bill_length_mm + species + bill_length_mm:species, ## data = .) ## ## Coefficients: ## (Intercept) bill_length_mm ## 66.45 93.75 ## speciesGentoo bill_length_mm:speciesGentoo ## -94.35 13.89 ``` ] .pull-right[ Un pingüino **Adelie** de **ancho de pico 18 mm** va a tener un peso estimado de: `$$\hat{Peso} = 66.45 + 93.75· 18 -94.35 · 0 + 13.89 · 18 · 0$$` `$$\hat{Peso} = 1753.95g$$` Mientras que si el pingüino es **Gentoo**: `$$\hat{Peso} = 66.45 + 93.75· 18 -94.35 · 1 + 13.89 · 18 · 1$$` `$$\hat{Peso} = 1909.62g$$` ] --- class: left, top, highlight-last-item # Resumen .huge[ Aprendimos a: - Ajustar un modelo con **más de un predictor** - Interpretar el efecto de **variables categóricas** - Interpretar las **interacciones** - Evaluar si un ajuste es "mejor" que otro usando el `\(R^2_{adj}\)` Ahora nos queda ver cómo podemos **inferir** el comportamiento de una población a partir de una muestra para regresiones múltiples. ] --- class: center, top # Referencias .left[.big[ - Mine Çetinkaya-Rundel and Johanna Hardin (2021). Introduction to Modern Statistics. Openintro Project. https://openintro-ims.netlify.app/index.html. ]]