Palmer Penguins Dataset

The Palmer Penguins Dataset package from RStudio was used to test data analysis with R.The Palmer Penguins Dataset is a set of observations for 3 species of penguins.

First we will load the packages that would be needed for this project:

library("ggplot2")
library("skimr")
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("palmerpenguins")

Summary of Palmer Penguins Dataset

From dataset below it shows 344 observations of 3 different species of penguins, Adelie, Chinstrap and Gentoo. The dataset has 8 sets of features for each record as below:

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2
skim_without_charts(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0

Sample Penguin Dataset

Here we view the first 10 records of the dataset by using the head() function. Note there are also some records with missing values, NA.

head(penguins,10)
## # A tibble: 10 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 2 more variables: sex <fct>, year <int>

Plots

We want to find which species between the 3 is the largest, we use a scatterplot to map each record and separate them by species. From the graph below it shows that most Gentoo species are the largest:

We can also further dive deeper into the data by finding out where the 3 species live on the 3 different islands. From the grpah below, Adelie species live and roam around all the 3 islands while Gentoo and Chinstrap each live at Biscoe and Dream island respectively with no observation on other islands.

ggplot(data=na.omit(penguins))+
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species, shape=species))+
  labs(title="Palmer Penguins: Body Mass vs. Flipper Length on Different Island", subtitle="Sample of Three Penguin Species",
       caption="Data collected by Dr. Kristen Gormen")+
  facet_wrap(~island)

This is just a simple analysis done on the Palmer Penguins dataset to get familiar with R. I find it very interesting on how easy R is to clean and visualize the data all in one place. Feel free to connect with me if you have any suggestions or comments!