library(tidyverse)
descriptions_data <-
read_csv(here::here("_data", "convict_descriptions.2020-12-17.csv"), guess_max = 100000)Tidying Tattoos: Handling missing data and non-tidy columns
What is “tidy data”?
“Tidy” data describes data that conforms to three rules:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Data usually needs to be in a tidy format to work with R Tidyverse tools. But it isn’t necessarily the best format for data entry or storage, and very few real world datasets are tidy. So it’s almost always necessary to do some tidying as well as cleaning up more general messiness before being able to analyse data, and the Criminal Tattoos data is no exception. This post introduces just a couple of very common issues:
- cleaning: missing data, which can be encoded in varying ways
- tidying: multiple values in a single cell
I think this is also a good introduction to the fluidity of working with data in software like R. The same data can be easily reshaped in many ways, depending on the task in hand, and be manipulated several times in the course of the same analysis.
The data
Load the Tidyverse packages and the descriptions data file. (For more on both of these, see this introductory post.)
I don’t want all the stuff in the file, so I’ll slim it down a bit, to keep only descriptions that contain tattoos and the columns that I’ll be working with here.
tattoo_descriptions <-
descriptions_data %>%
filter(hastattoo=="y") %>%
select(subrecordid, descyear, born, gender, designs, digest)Missing data
Summarising a variable often immediately highlights problems. In this case, missing description years (as well as birth years) are encoded as 0.
tattoo_descriptions %>%
count(descyear)# A tibble: 125 × 2
descyear n
<dbl> <int>
1 0 279
2 1793 1
3 1796 1
4 1797 11
5 1798 1
6 1801 1
7 1804 1
8 1806 2
9 1808 2
10 1809 1
# … with 115 more rows
R doesn’t know that 0 means “missing”; it just treats it as a number.
tattoo_descriptions %>%
count(descyear) %>%
ggplot(aes(x=descyear, y=n)) +
geom_line()
I could just filter out 0s in a plot like this, but it’s better to convert them to “NA” from the start so R will handle them properly. There are several ways to do this.
a simple case
If you want to convert every 0 in your data to NA, na_if() is used to be an easy and convenient option. (A recent software update outlawed this use of na_if(). Apparently it was unintended and “accidental”.)
tattoo_descriptions %>%
select(descyear, born) %>%
filter(descyear==0) %>%
# convert 0s in both descyear and born to NA
mutate(descyear = na_if(descyear, 0), born=na_if(born, 0))# A tibble: 279 × 2
descyear born
<dbl> <dbl>
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 NA 1797
8 NA NA
9 NA NA
10 NA NA
# … with 269 more rows
# na_if(0) # this is no longer permitted
# you could do this instead
# mutate(across(everything(), ~na_if(., 0)))more complex cases
A smarter option is replace_with_na() (from the {naniar} package). With this, you can specify which columns need attention, and list several possible values that might need conversion to NA (common variants might include -99 or “unknown”).
library(naniar)tattoo_descriptions %>%
select(descyear, born, gender) %>%
filter(descyear==0) %>%
# replace only 0s in descyear with NA
replace_with_na(replace=list(descyear=0))# A tibble: 279 × 3
descyear born gender
<dbl> <dbl> <chr>
1 NA 0 m
2 NA 0 m
3 NA 0 m
4 NA 0 m
5 NA 0 m
6 NA 0 m
7 NA 1797 m
8 NA 0 m
9 NA 0 m
10 NA 0 m
# … with 269 more rows
What about gender? For some reason there are both “u” for unknown and NA in the data. (Admittedly, there are only two of them so I’d be inclined to just drop them in this case…)
tattoo_descriptions %>%
count(gender)# A tibble: 4 × 2
gender n
<chr> <int>
1 f 3635
2 m 71811
3 u 1
4 <NA> 1
I can fix several missing data issues in the same line of code.
tattoo_descriptions %>%
select(descyear, born, gender) %>%
filter(gender=="u" | is.na(gender)) %>%
# replace 0s in descyear and born and "u" in gender with NA
replace_with_na(replace=list(descyear=0, born=0, gender="u"))# A tibble: 2 × 3
descyear born gender
<dbl> <dbl> <chr>
1 1881 NA <NA>
2 1881 1837 <NA>
After the 0s have been replaced with NAs, the line chart drops the missing years automatically (I can suppress the warning message, if that’s the behaviour I want, or investigate further if it isn’t).
tattoo_descriptions %>%
replace_with_na(replace=list(descyear=0)) %>%
count(descyear) %>%
ggplot(aes(x=descyear, y=n)) +
geom_line()
Multiple values in one cell
a simple case
The designs column contains all the tattoo designs identified in a description, separated with a “pipe” | symbol. This makes sense as a concise format for storing the data, but it’s not at all tidy or easy to work with.
tattoo_descriptions %>%
filter(!is.na(designs)) %>%
select(subrecordid, designs)# A tibble: 64,098 × 2
subrecordid designs
<chr> <chr>
1 cin77589 anchor|dog|moon|man and woman|diamond|half|sun|heart
2 cin77592 anchor
3 cin77603 man and woman|crucifix|mermaid
4 cin77604 dot
5 cin77609 dot|ring
6 cin77616 sun moon and stars
7 cin77619 woman|flower pot
8 cin77622 figure|woman|shamrock|thistle
9 cin77623 crown
10 cin77626 ship|anchor|crown|thistle|ring|rose|mermaid
# … with 64,088 more rows
Again, I have more than one option for this.
One is a two-step process. First, str_split() splits up the string into pieces on each | symbol, putting them in a “list-column”.
tattoo_descriptions %>%
select(subrecordid, designs) %>%
filter(!is.na(designs)) %>%
mutate(design = str_split(designs, "\\|")) # A tibble: 64,098 × 3
subrecordid designs design
<chr> <chr> <list>
1 cin77589 anchor|dog|moon|man and woman|diamond|half|sun|heart <chr [8]>
2 cin77592 anchor <chr [1]>
3 cin77603 man and woman|crucifix|mermaid <chr [3]>
4 cin77604 dot <chr [1]>
5 cin77609 dot|ring <chr [2]>
6 cin77616 sun moon and stars <chr [1]>
7 cin77619 woman|flower pot <chr [2]>
8 cin77622 figure|woman|shamrock|thistle <chr [4]>
9 cin77623 crown <chr [1]>
10 cin77626 ship|anchor|crown|thistle|ring|rose|mermaid <chr [7]>
# … with 64,088 more rows
Then unnest() is used to unpack each element of the list onto its own line.
tattoo_descriptions_unnest <-
tattoo_descriptions %>%
filter(!is.na(designs)) %>%
mutate(design = str_split(designs, "\\|")) %>%
unnest(design)
tattoo_descriptions_unnest %>%
select(subrecordid, design) # A tibble: 176,777 × 2
subrecordid design
<chr> <chr>
1 cin77589 anchor
2 cin77589 dog
3 cin77589 moon
4 cin77589 man and woman
5 cin77589 diamond
6 cin77589 half
7 cin77589 sun
8 cin77589 heart
9 cin77592 anchor
10 cin77603 man and woman
# … with 176,767 more rows
But in fact I find it easier to use unnest_tokens() ({tidytext} package) to do this kind of task. It has more options for more complex cases, can be done in a single step, and is quicker.
library(tidytext)
tattoo_descriptions_unnest_tokens <-
tattoo_descriptions %>%
filter(!is.na(designs)) %>%
unnest_tokens(design, designs, token="regex", pattern="\\|", to_lower = FALSE)
tattoo_descriptions_unnest_tokens %>%
select(subrecordid, design) # A tibble: 176,777 × 2
subrecordid design
<chr> <chr>
1 cin77589 anchor
2 cin77589 dog
3 cin77589 moon
4 cin77589 man and woman
5 cin77589 diamond
6 cin77589 half
7 cin77589 sun
8 cin77589 heart
9 cin77592 anchor
10 cin77603 man and woman
# … with 176,767 more rows
more complex cases
The digest column is a considerably more complicated case. It contains what we called “body chunks” for each description - each body location identified as having at least one tattoo, followed by the list of tattoo terms. The format is [body part : list|of|terms], with chunks separated by a space.
tattoo_descriptions %>%
select(digest) # A tibble: 75,448 × 1
digest
<chr>
1 [right arm : man and woman|sun|dog|ST|1838] [left arm : half|moon|sun|ancho…
2 [right arm : CPEP] [right hand : JP]
3 [right arm : anchor|LARK] [ : EM]
4 [left arm : TH]
5 [right|shoulder : crucifix] [right arm : man and woman] [left arm : mermai…
6 [right hanging body specifier|hand : dot|dot]
7 [left hand : dot|ring]
8 [left arm : MB]
9 [left arm : sun moon and stars]
10 [left hand : flower pot] [right following body specifier : woman]
# … with 75,438 more rows
Three distinct steps are needed:
- put each chunk on its own line
- separate body part and tattoos
- put each tattoo on its own line
For the first step, I’ll use unnest_tokens again, followed by str_remove_all() to get rid of the square brackets.
(This needs a more advanced regex than the previous example; if you want to know more about how the regex works, try pasting the expression (?=\[[^\]]+\]) * into here.)
tattoo_chunks_unnest_tokens <-
tattoo_descriptions %>%
select(subrecordid, digest) %>%
unnest_tokens(chunk, digest, token="regex", pattern="(?=\\[[^\\]]+\\]) *", to_lower = FALSE) %>%
mutate(chunk = str_remove_all(chunk, "\\[|\\]"))
tattoo_chunks_unnest_tokens# A tibble: 143,609 × 2
subrecordid chunk
<chr> <chr>
1 cin77589 "right arm : man and woman|sun|dog|ST|1838 "
2 cin77589 "left arm : half|moon|sun|anchor|heart|diamond|IMB"
3 cin77590 "right arm : CPEP "
4 cin77590 "right hand : JP"
5 cin77592 "right arm : anchor|LARK "
6 cin77592 " : EM"
7 cin77598 "left arm : TH"
8 cin77603 "right|shoulder : crucifix "
9 cin77603 "right arm : man and woman "
10 cin77603 "left arm : mermaid"
# … with 143,599 more rows
Step two, splitting body part and tattoo terms into separate columns, uses the separate() function.
tattoo_chunks_unnest_tokens_separated <-
tattoo_chunks_unnest_tokens %>%
separate(chunk, into=c("body", "tattoos"), sep=" *: *")
tattoo_chunks_unnest_tokens_separated# A tibble: 143,609 × 3
subrecordid body tattoos
<chr> <chr> <chr>
1 cin77589 "right arm" "man and woman|sun|dog|ST|1838 "
2 cin77589 "left arm" "half|moon|sun|anchor|heart|diamond|IMB"
3 cin77590 "right arm" "CPEP "
4 cin77590 "right hand" "JP"
5 cin77592 "right arm" "anchor|LARK "
6 cin77592 "" "EM"
7 cin77598 "left arm" "TH"
8 cin77603 "right|shoulder" "crucifix "
9 cin77603 "right arm" "man and woman "
10 cin77603 "left arm" "mermaid"
# … with 143,599 more rows
Finally, to put each tattoo on its own line, I repeat the earlier step with unnest_tokens. The new output is a bit different from the first case because the digest includes everything identified as a tattoo, not just pictorial designs.
tattoo_chunks_tattoos_unnest_tokens <-
tattoo_chunks_unnest_tokens_separated %>%
unnest_tokens(tattoo, tattoos, token="regex", pattern="\\|", to_lower = FALSE)
tattoo_chunks_tattoos_unnest_tokens# A tibble: 343,417 × 3
subrecordid body tattoo
<chr> <chr> <chr>
1 cin77589 right arm "man and woman"
2 cin77589 right arm "sun"
3 cin77589 right arm "dog"
4 cin77589 right arm "ST"
5 cin77589 right arm "1838 "
6 cin77589 left arm "half"
7 cin77589 left arm "moon"
8 cin77589 left arm "sun"
9 cin77589 left arm "anchor"
10 cin77589 left arm "heart"
# … with 343,407 more rows
Now what?
Now I can have a lot more fun!
tattoo_descriptions_unnest_tokens %>%
count(design, sort=TRUE) %>%
head(10) %>%
ggplot(aes(x=fct_reorder(design, n), y=n)) +
geom_col() +
coord_flip() +
labs(y="count", x=NULL, title="Top ten tattoo designs")
Which are the most popular body locations for tattoos? This needs more cleaning, but you get the gist.
tattoo_chunks_unnest_tokens_separated %>%
mutate(body = str_replace_all(body, "\\|", " ")) %>%
count(body, sort = TRUE) %>%
head(10) %>%
ggplot(aes(x=fct_reorder(body, n), y=n)) +
geom_col() +
coord_flip() +
labs(y="count", x=NULL, title="Top ten tattoo locations")
And here’s one I cleaned up earlier…
