merge-fractions.Rmd
visProteomics functions are intended for exploration and visualization of fractions of the analyzed fractionated sample. To use the function we need one tidy data frame with all proteins from all fractions listed in rows and with two required columns, one giving the protein ID and another with the number of fraction in which the protein is detected. Additional columns with further protein description can be added and used in visualization/exploration.
In this vignette, we demonstrate how to create a tidy data frame of proteins step by step, with easily understandable R function calls. There are other, possibly more efficient approaches, however, we wanted the code to be easy to understand and customize even for R beginners.
Algorithms for protein detection usually return a separate list of detected proteins for every fraction. This means that for every fraction of the fractionated sample we have a separate file. We want to merge all these files in one tidy data frame. The easiest way is to put all files in one folder and to follow a specific pattern in naming files so that the fraction number is easily read from the file name. For demonstration, we saved all files in the folder “data-fractions” and named them as "?_Fract_proteins_visProteomics.txt“, where”?" is a fraction number. In our case, all files are saved in the folder “./data-fractions/”. We use the following code to create a new data frame.
folder_name <- "./data-fractions/" #name of the folder with saved files
file_names <- list.files(path=folder_name, pattern = "*.txt") #extracts names of all files in the folder with the given pattern
file_names <- paste0(folder_name, file_names)
number_of_fractions <- length(file_names) #number of fractions equals the number of files in the folder
data_total <- read.table(file_names[1], header=T, stringsAsFactors = F, sep='\t') #save the first
data_total$Fraction_number <- as.numeric(strsplit(gsub(folder_name, "", file_names[1]),"_")[[1]][1]) #add fraction number: remove the folder name and take the number before the first underscore (this will depend on the naming pattern)
for(i in 2:number_of_fractions){
df_temp <- read.table(file_names[i], header=T, stringsAsFactors = F, sep='\t') #read one of the files
df_temp$Fraction_number <- as.numeric(strsplit(gsub(folder_name, "", file_names[i]),"_")[[1]][1]) #add fraction number
data_total <- rbind(data_total, df_temp) # we add this fraction to the data_total
}
Now we may want to tidy up our data frame: rename columns, remove or add columns, check whether the format of columns is correct.
colnames(data_total) <- colnames(data_total) <- c('Accession', 'Description', 'Sum_Coverage','Sum_Proteins',
'Sum_Unique_Peptides', 'Sum_Peptides', 'Sum_PSM', 'Peptides_Mascot',
'Peptides_Sequest', 'PSM_Mascot','PSM_Sequest', 'Coverage_Mascot',
'Coverage_Sequest', 'Score_Mascot','Score_Sequest', 'AAs', 'MW_kDa', 'pI', 'Fraction_Number')
data_total <- data_total[,c('Accession', 'Description', 'Sum_Coverage', 'Sum_Proteins',
'Sum_Unique_Peptides', 'Sum_Peptides', 'AAs', 'MW_kDa', 'pI', 'Fraction_Number')]
data_total[1:5,] #view the first 5 rows
#> Accession
#> 1 P02768
#> 2 P00915
#> 3 P31944
#> 4 Q14118
#> 5 P02671
#> Description
#> 1 Serum albumin OS=Homo sapiens GN=ALB PE=1 SV=2 - [ALBU_HUMAN]
#> 2 Carbonic anhydrase 1 OS=Homo sapiens GN=CA1 PE=1 SV=2 - [CAH1_HUMAN]
#> 3 Caspase-14 OS=Homo sapiens GN=CASP14 PE=1 SV=2 - [CASPE_HUMAN]
#> 4 Dystroglycan OS=Homo sapiens GN=DAG1 PE=1 SV=2 - [DAG1_HUMAN]
#> 5 Fibrinogen alpha chain OS=Homo sapiens GN=FGA PE=1 SV=2 - [FIBA_HUMAN]
#> Sum_Coverage Sum_Proteins Sum_Unique_Peptides Sum_Peptides AAs MW_kDa pI
#> 1 22.99 % 6 16 16 609 69.3 6.28
#> 2 35.63 % 3 7 7 261 28.9 7.12
#> 3 14.46 % 1 4 4 242 27.7 5.58
#> 4 3.80 % 1 3 3 895 97.4 8.56
#> 5 2.54 % 1 2 2 866 94.9 6.01
#> Fraction_Number
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
The column ‘Sum_Coverage’ should be numeric. Also, we want shorter ‘Description’ of proteins: the first part of the string before ‘OS=’.
data_total$Sum_Coverage <- sapply(data_total$Sum_Coverage, function(x) as.numeric(gsub("%", "", x))) #convert percentage to number
data_total$Description_Trimmed <- sapply(data_total$Description, function(x) strsplit(x, " OS=")[[1]][1]) #add additional row with description
data_total[1:5,c('Accession', 'Sum_Coverage', 'Description_Trimmed')] # check changed columns
#> Accession Sum_Coverage Description_Trimmed
#> 1 P02768 22.99 Serum albumin
#> 2 P00915 35.63 Carbonic anhydrase 1
#> 3 P31944 14.46 Caspase-14
#> 4 Q14118 3.80 Dystroglycan
#> 5 P02671 2.54 Fibrinogen alpha chain
We may want to remove some proteins in which we are not interested. We can do it based on their ID, their description or some other column. For example, we may want to remove all proteins described as “keratin” and keep only the first 20 proteins from each fraction based on ‘Sum_Peptides’.
data_total <- data_total[!grepl('keratin', tolower(data_total$Description)),] #remove proteins described as 'keratin' (regardless of uppercase characters)
data_top20 <- data.frame(matrix(ncol=ncol(data_total), nrow=0)) #create new empty data frame
colnames(data_top20) <- colnames(data_total)
fraction_names <- unique(data_total$Fraction_Number) #all fractions
for(i in 1:length(fraction_names)){
df_temp <- data_total[data_total$Fraction_Number==fraction_names[i],] #only proteins in the ith fraction
df_temp <- df_temp[order(df_temp$Sum_Peptides, decreasing=TRUE),] #order proteins by Sum_Peptides
data_top20 <- rbind(data_top20, df_temp[1:min(20, nrow(df_temp)),]) #keep the first 20 proteins
}
The resulting two data frames are included in the package and can be loaded with ‘data’ function.