Package smartsnp is not a data-conversion tool. Inspired by the command-line tool SMARTPCA, smartsnp handles SNP datasets in text format and in SMARTPCA formats (uncompressed = EIGENSTRAT or compressed binary = PACKENDANCESTRYMAP) and a general genotype matrix format. However, both VCF (.vcf) and PLINK (.bed) formats are frequently used for storing genetic variation data. In this vignette we provide a quick and robust solution for how to transform these two formats into a general genotype matrix (i.e. where homozygous genotypes are coded as 0 or 2, and heterozygotes as 1) that can be used with the smartsnp package.
The general strategy is to use the plink2 software for transforming VCF or PLINK/bed files into a general (transposed) genotype matrix. It is “transposed” because PLINK and VCF files typically have samples in rows, whereas the general input file for smartsnp has samples in columns.
We make heavily use of the plink2 software, which is a comprehensive update to Shaun Purcell’s PLINK original command-line program. Binary downloads and an installation guide is available here: https://www.cog-genomics.org/plink2
In the plink2 manual, the file format that we transform our data into is called “.traw (variant-major additive component file)”. See here for more information: https://www.cog-genomics.org/plink/1.9/formats#traw
Note that the .traw format can be directly used with the lastest development version of smartsnp, no further data transformation is necessary.
The R package sim1000G contains a small VCF, an unfiltered region from the 1000 genomes Phase III sequencing data VCF, chromosome 4, CEU samples. We will first load the package and then use this file as an example dataset.
examples_dir <- system.file("examples", package = "sim1000G") vcf_file <- file.path(examples_dir, "region.vcf.gz") file.copy(from = vcf_file, to = "./") # Copy the file to the cwd
##  FALSE
We could have skipped the intermediate step of transforming the VCF into a PLINK format. The plink2 software allows to directly transform the VCF into the .traw format.
system("plink --vcf region.vcf.gz --recode A-transpose --out region_genotypeMatrix")
The VCF file just contained data from a single group (CEU). However, just to demonstrate that this file can be used with smartsnp we’ll run a simple pca analysis. Importantly, you will have to set the missing_value parameter to “NA”.
# To use .traw files, we will need to load the latest development version of smartsnp. install.packages("devtools") devtools::install_github("ChristianHuber/smartsnp")
# Load the PLINK (.fam) file to get the number of samples numSamples = nrow(read.table("region.fam")) # There is just a single group in this data group_id <- rep(c("CEU"), length.out = numSamples) # Running smart_pca sm.pca <- smart_pca(snp_data = "region_genotype.traw", sample_group = group_id, missing_value = NA) # Here is a plot of the first two components: plot(sm.pca$pca.sample_coordinates[, c(3,4)])
Voila! Now to go back to the old working directory: