TP4 - Análisis exploratorio de datos

En el siguiente TP vamos a utilizar las herramientas de análisis exploratorio de datos provistas en el curso para analizar un dataset que nunca antes vimos.

Los datos

Este dataset consta de todas las películas originales de Netflix estrenadas hasta del 1 de junio de 2021. Además, también incluye todos los documentales y especiales de Netflix. Los datos se bajaron de esta página de Kaggle. Los miembros de la comunidad votan las puntuaciones de IMDB, y la mayoría de las películas tienen más de 1.000 reseñas.

Las columnas del dataset son:

Título de la película Title
Género de la película Genre
Fecha original de estrenoPremiere
Duración en minutos Runtime
Puntaje en IMDB (al 01/06/21) IMDB Score
Idiomas disponibles (al 01/06/21) Language

Para más detalles de los datos pueden consultar acá.

Los datos están en la carpeta Practicos/tp4-EDA/data del repositorio. Los podemos bajar y cargar con read_csv() o cargar directo desde la url.

library(tidyverse)
library(here)

netflix <- read_csv(here("docs/Practicos/tp4-EDA/data/NetflixOriginals.csv"))

1 - Mirar los datos

Useamos las funciones summary(), str() y glimpse() para ver qué estructura y qué tipos de variables tiene nuestro dataset

¿Tiene valores NA?
¿Alguna de las variables no es del tipo que corresponde?

Ayuda: Para convertir una fecha de chr a Date una de las formas más simples es usando el paquete {lubridate}. Por ejemplo, la función mdy("August 5, 2019") nos da como resutado una fecha "2019-08-05". La ventaja de tener variables Date es que nos permite ordenarlas, realizar operaciones con ellas, etc.
¿Hay algún valor sospechoso?

Una vez corregidos los probelmas del dataset imprimir un resumen usando la función skim() del paquete {skimr}.

# Solución
str(netflix)
#> spec_tbl_df [585 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ Title     : chr [1:585] "Enter the Anime" "Dark Forces" "The App" "The Open House" ...
#>  $ Genre     : chr [1:585] "Documentary" "Thriller" "Science fiction/Drama" "Horror thriller" ...
#>  $ Premiere  : chr [1:585] "August 5, 2019" "August 21, 2020" "December 26, 2019" "January 19, 2018" ...
#>  $ Runtime   : num [1:585] 58 81 79 94 90 147 112 149 73 139 ...
#>  $ IMDB Score: num [1:585] 2.5 2.6 2.6 3.2 3.4 3.5 3.7 3.7 3.9 4.1 ...
#>  $ Language  : chr [1:585] "English/Japanese" "Spanish" "Italian" "English" ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   Title = col_character(),
#>   ..   Genre = col_character(),
#>   ..   Premiere = col_character(),
#>   ..   Runtime = col_double(),
#>   ..   `IMDB Score` = col_double(),
#>   ..   Language = col_character()
#>   .. )
#>  - attr(*, "problems")=<externalptr>
summary(netflix)
#>     Title              Genre             Premiere            Runtime      
#>  Length:585         Length:585         Length:585         Min.   :  4.00  
#>  Class :character   Class :character   Class :character   1st Qu.: 86.00  
#>  Mode  :character   Mode  :character   Mode  :character   Median : 97.00  
#>                                                           Mean   : 93.57  
#>                                                           3rd Qu.:108.00  
#>                                                           Max.   :209.00  
#>                                                                           
#>    IMDB Score      Language        
#>  Min.   :2.500   Length:585        
#>  1st Qu.:5.700   Class :character  
#>  Median :6.350   Mode  :character  
#>  Mean   :6.272                     
#>  3rd Qu.:7.000                     
#>  Max.   :9.000                     
#>  NA's   :1
# Hay un NA
# El mínimo de duración es raro (4 minutos)
# Los ratings están bien
glimpse(netflix)
#> Rows: 585
#> Columns: 6
#> $ Title        <chr> "Enter the Anime", "Dark Forces", "The App", "The Open Ho…
#> $ Genre        <chr> "Documentary", "Thriller", "Science fiction/Drama", "Horr…
#> $ Premiere     <chr> "August 5, 2019", "August 21, 2020", "December 26, 2019",…
#> $ Runtime      <dbl> 58, 81, 79, 94, 90, 147, 112, 149, 73, 139, 58, 112, 97, …
#> $ `IMDB Score` <dbl> 2.5, 2.6, 2.6, 3.2, 3.4, 3.5, 3.7, 3.7, 3.9, 4.1, 4.1, 4.…
#> $ Language     <chr> "English/Japanese", "Spanish", "Italian", "English", "Hin…
# Premiere es un tipo de variable date y aparece como <chr>

# Arreglemos la fecha
library(lubridate)
netflix <- netflix %>% 
  mutate(Premiere = mdy(Premiere))

# Tiramos el NA
netflix <- netflix %>% drop_na()

glimpse(netflix) # Ahora sí
#> Rows: 584
#> Columns: 6
#> $ Title        <chr> "Enter the Anime", "Dark Forces", "The App", "The Open Ho…
#> $ Genre        <chr> "Documentary", "Thriller", "Science fiction/Drama", "Horr…
#> $ Premiere     <date> 2019-08-05, 2020-08-21, 2019-12-26, 2018-01-19, 2020-10-…
#> $ Runtime      <dbl> 58, 81, 79, 94, 90, 147, 112, 149, 73, 139, 58, 112, 97, …
#> $ `IMDB Score` <dbl> 2.5, 2.6, 2.6, 3.2, 3.4, 3.5, 3.7, 3.7, 3.9, 4.1, 4.1, 4.…
#> $ Language     <chr> "English/Japanese", "Spanish", "Italian", "English", "Hin…

library(skimr)
skim(netflix)

Data summary
Name	netflix
Number of rows	584
Number of columns	6
_______________________
Column type frequency:
character	3
Date	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Title	1	2	105	584
Genre	1	3	36	115
Language	1	4	26	38

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Premiere	0	1	2014-12-13	2021-05-27	2019-10-17	387

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Runtime	0	1	93.58	27.76	4.0	86.0	97.00	108	209	▁▂▇▁▁
IMDB Score	0	1	6.27	0.98	2.5	5.7	6.35	7	9	▁▂▇▇▁

# Podemos ver que el valor de duración que parecía raro tal vez no lo sea

¿Cuáles son los tres géneros con más estrenos?

netflix %>%
    count(Genre) %>%
    arrange(desc(n)) %>%
    slice_head(n = 3)
#> # A tibble: 3 × 2
#>   Genre           n
#>   <chr>       <int>
#> 1 Documentary   159
#> 2 Drama          77
#> 3 Comedy         49

¿Y los tres idiomas con más estrenos?

netflix %>%
    count(Language) %>%
    arrange(desc(n)) %>%
    slice_head(n = 3)
#> # A tibble: 3 × 2
#>   Language     n
#>   <chr>    <int>
#> 1 English    401
#> 2 Hindi       33
#> 3 Spanish     31

2 - Variación

Exploremos con un simple gráfico de barras la cantidad de películas de cada género.

¿Cuántos géneros hay?
¿Cómo es la distribución por género?

# Solución
netflix %>%
  count(Genre, name = "Frecuencia") %>%
  filter(Frecuencia > 5) %>%
  ggplot(aes(x = reorder(Genre, Frecuencia), y = Frecuencia)) +
  geom_col() +
  coord_flip()

Ahora miremos la distribución de duraciones y de rating de IMDB ¿Qué podemos decir al respecto?

# Solución
netflix %>% 
  ggplot(aes(x = `IMDB Score`)) +
  geom_histogram(fill = "#1380A1") +
  labs(x = NULL,
       y = NULL) +
  labs(title = "IMDB Score") +
  theme_minimal()

# La distribución es bastante normal pero con cola para abajo

netflix %>% 
  ggplot(aes(x = Runtime)) +
  geom_histogram(fill = "#1380A1") +
  labs(x = NULL,
       y = NULL) +
  labs(title = "Runtime") +
  theme_minimal()

# pareciera haber un outlier de duración

Veamos cómo se distribuyen los ratings de IMDB para los géneros Drama y Comedy.

# Solución
generos <-  c("Comedy", "Drama")

netflix %>% 
  filter(Genre %in% generos) %>%
  ggplot(aes(x = `IMDB Score`,
             color = Genre)) +
  geom_freqpoly() +
  labs(x = NULL,
       y = NULL) +
  theme_minimal() + 
  theme(legend.position = "top")

Por último: ¿Qué pasa con las distribuciones de duraciones para Comedy y Documentary?

# Solución
generos <-  c("Comedy", "Documentary")

netflix %>% 
  filter(Genre %in% generos) %>%
  ggplot(aes(x = Runtime,
             color = Genre)) +
  geom_freqpoly() +
  labs(x = NULL,
       y = NULL) +
  theme_minimal() + 
  theme(legend.position = "top")

# Hay un montón de documentales cortos

3 - Covariación

Utilizando un boxplot veamos si hay alguna relación entre las películas de los géneros Comedy, Drama y Documentary y su rating de IMDB.

# Solución
generos <- c("Comedy", "Drama", "Documentary")

netflix %>% 
  filter(Genre %in% generos) %>%
  ggplot(aes(x = Genre,
             y = `IMDB Score`,
             color = Genre,
             fill = Genre)) +
  geom_boxplot(alpha = .5) +
  theme_minimal()

Luego, usando la función geom_tile() miremos la cantidad de muestras para las combinaciones de los tres géneros y los tres idiomas con más estrenos.

# Solución
generos <- c("Comedy", "Drama", "Documentary")
idiomas <- c("English", "Hindi", "Spanish")

netflix %>%
  filter(Genre %in% generos) %>%
  filter(Language %in% idiomas) %>%
  count(Genre, Language, name = "Frecuencia") %>%
  ggplot(aes(x = Genre, y = Language, fill = Frecuencia)) +
  geom_tile() +
  geom_text(aes(label = Frecuencia), size = 3, color = "white") +
  scale_fill_viridis_c() +
  theme_minimal()

Ahora vamos a ver la covariación entre dos variables continuas. Vemos si existe alguna relación entre la fecha de estreno y el rating de IMDB.

# Solución
netflix %>% 
  ggplot(aes(x = Premiere,
             y = `IMDB Score`)) +
  geom_point() +
   geom_smooth(method = lm) +
  theme_minimal()

¿Y si nos quedamos con los tres géneros más populares y lo vemos por género?

# Solución
netflix %>% 
  filter(Genre %in% c("Comedy", "Drama", "Documentary")) %>%
  ggplot(aes(x = Premiere,
             y = `IMDB Score`,
             color = Genre)) +
  geom_point() +
  geom_smooth(method = lm,
              se = FALSE) +
  theme_minimal() +
  theme(legend.position = "top")

Por último, utilicemos la función ggpairs() de {GGally} para ver las distribuciones y correlaciones de todas las variables numéricas de netflix.

# Solución
library(GGally)

netflix %>% 
  select(all_of(c("Premiere", "Runtime", "IMDB Score"))) %>% 
  ggpairs() +
  theme_minimal()

4 - Outliers

Usemos la librería {Routliers} para ver si tenemos outliers univariados en las variables Runtime (duración) y IMDB Score (rating de IMDB)

# Solución
library(Routliers)

outliers_runtime <- outliers_mad(x = netflix$Runtime)
outliers_runtime
#> Call:
#> outliers_mad.default(x = netflix$Runtime)
#> 
#> Median:
#> [1] 97
#> 
#> MAD:
#> [1] 16.3086
#> 
#> Limits of acceptable range of values:
#> [1]  48.0742 145.9258
#> 
#> Number of detected outliers
#>  extremely low extremely high          total 
#>             56             10             66

plot_outliers_mad(outliers_runtime, 
                  x = netflix$Runtime)


netflix %>%
  filter(Runtime < outliers_runtime$limits[1]) %>%
  count(Genre) %>% 
  arrange(desc(n))
#> # A tibble: 14 × 2
#>    Genre                     n
#>    <chr>                 <int>
#>  1 Documentary              32
#>  2 Aftershow / Interview     6
#>  3 Animation / Short         4
#>  4 Animation                 3
#>  5 Comedy                    2
#>  6 Animation / Comedy        1
#>  7 Animation / Musicial      1
#>  8 Anime / Short             1
#>  9 Comedy / Musical          1
#> 10 Drama / Short             1
#> 11 Making-of                 1
#> 12 Mockumentary              1
#> 13 Musical / Short           1
#> 14 Stop Motion               1

netflix %>%
  filter(Runtime > outliers_runtime$limits[2])
#> # A tibble: 10 × 6
#>    Title                          Genre Premiere   Runtime `IMDB Score` Language
#>    <chr>                          <chr> <date>       <dbl>        <dbl> <chr>   
#>  1 Drive                          Acti… 2019-11-01     147          3.5 Hindi   
#>  2 The Last Days of American Cri… Heis… 2020-06-05     149          3.7 English 
#>  3 Army of the Dead               Zomb… 2021-05-21     148          5.9 English 
#>  4 Citation                       Drama 2020-11-06     151          6.2 English 
#>  5 The Forest of Love             Drama 2019-10-11     151          6.3 Japanese
#>  6 Da 5 Bloods                    War … 2020-06-12     155          6.5 English 
#>  7 Raat Akeli Hai                 Thri… 2020-07-31     149          7.3 Hindi   
#>  8 Ludo                           Anth… 2020-11-12     149          7.6 Hindi   
#>  9 The Irishman                   Crim… 2019-11-27     209          7.8 English 
#> 10 Springsteen on Broadway        One-… 2018-12-16     153          8.5 English

# Solución
outliers_IMDB <- outliers_mad(x = netflix$`IMDB Score`)
outliers_IMDB
#> Call:
#> outliers_mad.default(x = netflix$`IMDB Score`)
#> 
#> Median:
#> [1] 6.35
#> 
#> MAD:
#> [1] 0.96369
#> 
#> Limits of acceptable range of values:
#> [1] 3.45893 9.24107
#> 
#> Number of detected outliers
#>  extremely low extremely high          total 
#>              5              0              5

plot_outliers_mad(outliers_IMDB, 
                  x = netflix$`IMDB Score`)


netflix %>%
  filter(`IMDB Score` < outliers_IMDB$limits[1])
#> # A tibble: 5 × 6
#>   Title           Genre                 Premiere   Runtime `IMDB Score` Language
#>   <chr>           <chr>                 <date>       <dbl>        <dbl> <chr>   
#> 1 Enter the Anime Documentary           2019-08-05      58          2.5 English…
#> 2 Dark Forces     Thriller              2020-08-21      81          2.6 Spanish 
#> 3 The App         Science fiction/Drama 2019-12-26      79          2.6 Italian 
#> 4 The Open House  Horror thriller       2018-01-19      94          3.2 English 
#> 5 Kaali Khuhi     Mystery               2020-10-30      90          3.4 Hindi

¿Qué podemos decir de los outliers de Runtime? ¿Los podemos categorizar de alguna forma? ¿Nos pueden dar alguna información sobre las producciones de Netflix?

¿Y sobre los outliers de rating de IMDB?

Por último, analicemos si hay outliers multivariados en ambas variables.

# Solución
outliers_multi <- outliers_mcd(x = cbind(netflix$Runtime,
                                         netflix$`IMDB Score`))
outliers_multi
#> Call:
#> outliers_mcd.default(x = cbind(netflix$Runtime, netflix$`IMDB Score`))
#> 
#> Limit distance of acceptable values from the centroid :
#> [1] 9.21034
#> 
#> Number of detected outliers:
#> total 
#>    95

plot_outliers_mcd(outliers_multi,
                  x = cbind(netflix$Runtime,
                            netflix$`IMDB Score`))