Package smartsnp is not a data-conversion tool. Inspired by the command-line tool SMARTPCA, smartsnp handles SNP datasets in text format and in SMARTPCA formats (uncompressed = EIGENSTRAT or compressed binary = PACKENDANCESTRYMAP) and a general genotype matrix format. However, both VCF (.vcf) and PLINK (.bed) formats are frequently used for storing genetic variation data. In this vignette we provide a quick and robust solution for how to transform these two formats into a general genotype matrix (i.e. where homozygous genotypes are coded as 0 or 2, and heterozygotes as 1) that can be used with the smartsnp package.

The general strategy is to use the plink2 software for transforming VCF or PLINK/bed files into a general (transposed) genotype matrix. It is “transposed” because PLINK and VCF files typically have samples in rows, whereas the general input file for smartsnp has samples in columns.

We make heavily use of the plink2 software, which is a comprehensive update to Shaun Purcell’s PLINK original command-line program. Binary downloads and an installation guide is available here: https://www.cog-genomics.org/plink2

In the plink2 manual, the file format that we transform our data into is called “.traw (variant-major additive component file)”. See here for more information: https://www.cog-genomics.org/plink/1.9/formats#traw

Note that the .traw format can be directly used with the lastest development version of smartsnp, no further data transformation is necessary.

Download a small example VCF

The R package sim1000G contains a small VCF, an unfiltered region from the 1000 genomes Phase III sequencing data VCF, chromosome 4, CEU samples. We will first load the package and then use this file as an example dataset.

library(sim1000G)

# First set the current working directory (cwd) to where you want to download the data to (note that the *Downloads* folder might not exist on your computer, e.g. if you have a Windows system):
oldwd <- getwd()
setwd("~/Downloads/")
examples_dir <- system.file("examples", package = "sim1000G")
vcf_file <- file.path(examples_dir, "region.vcf.gz")

file.copy(from = vcf_file, to = "./") # Copy the file to the cwd
## [1] FALSE

VCF to raw genotype (.traw)

We could have skipped the intermediate step of transforming the VCF into a PLINK format. The plink2 software allows to directly transform the VCF into the .traw format.

system("plink --vcf region.vcf.gz --recode A-transpose --out region_genotypeMatrix")

Running smartpca

The VCF file just contained data from a single group (CEU). However, just to demonstrate that this file can be used with smartsnp we’ll run a simple pca analysis. Importantly, you will have to set the missing_value parameter to “NA”.

# To use .traw files, we will need to load the latest development version of smartsnp.
install.packages("devtools")
devtools::install_github("ChristianHuber/smartsnp")
# Load the PLINK (.fam) file to get the number of samples
numSamples = nrow(read.table("region.fam"))

# There is just a single group in this data
group_id <- rep(c("CEU"), length.out = numSamples)

# Running smart_pca
sm.pca <- smart_pca(snp_data = "region_genotype.traw", 
                    sample_group = group_id,
                    missing_value = NA)

# Here is a plot of the first two components:
plot(sm.pca$pca.sample_coordinates[, c(3,4)])

plot of chunk data_transformation_example_plot

Voila! Now to go back to the old working directory:

setwd(oldwd)