Merge fractions • visProteomics

visProteomics functions are intended for exploration and visualization of fractions of the analyzed fractionated sample. To use the function we need one tidy data frame with all proteins from all fractions listed in rows and with two required columns, one giving the protein ID and another with the number of fraction in which the protein is detected. Additional columns with further protein description can be added and used in visualization/exploration.

In this vignette, we demonstrate how to create a tidy data frame of proteins step by step, with easily understandable R function calls. There are other, possibly more efficient approaches, however, we wanted the code to be easy to understand and customize even for R beginners.

Uploading and merging data

Algorithms for protein detection usually return a separate list of detected proteins for every fraction. This means that for every fraction of the fractionated sample we have a separate file. We want to merge all these files in one tidy data frame. The easiest way is to put all files in one folder and to follow a specific pattern in naming files so that the fraction number is easily read from the file name. For demonstration, we saved all files in the folder “data-fractions” and named them as "?_Fract_proteins_visProteomics.txt“, where”?" is a fraction number. In our case, all files are saved in the folder “./data-fractions/”. We use the following code to create a new data frame.

folder_name <- "./data-fractions/" #name of the folder with saved files
file_names <- list.files(path=folder_name, pattern = "*.txt") #extracts names of all files in the folder with the given pattern
file_names <- paste0(folder_name, file_names)
number_of_fractions <- length(file_names) #number of fractions equals the number of files in the folder
data_total <- read.table(file_names[1], header=T, stringsAsFactors = F, sep='\t') #save the first 
data_total$Fraction_number <- as.numeric(strsplit(gsub(folder_name, "", file_names[1]),"_")[[1]][1]) #add fraction number: remove the folder name and take the number before the first underscore (this will depend on the naming pattern)
for(i in 2:number_of_fractions){
  df_temp <- read.table(file_names[i], header=T, stringsAsFactors = F, sep='\t') #read one of the files
  df_temp$Fraction_number <- as.numeric(strsplit(gsub(folder_name, "", file_names[i]),"_")[[1]][1]) #add fraction number
  data_total <- rbind(data_total, df_temp) # we add this fraction to the data_total
}

Tidying up the data frame

Now we may want to tidy up our data frame: rename columns, remove or add columns, check whether the format of columns is correct.

colnames(data_total) <- colnames(data_total) <- c('Accession', 'Description', 'Sum_Coverage','Sum_Proteins',
                          'Sum_Unique_Peptides', 'Sum_Peptides', 'Sum_PSM', 'Peptides_Mascot',
                          'Peptides_Sequest', 'PSM_Mascot','PSM_Sequest', 'Coverage_Mascot',
                          'Coverage_Sequest', 'Score_Mascot','Score_Sequest', 'AAs', 'MW_kDa', 'pI', 'Fraction_Number')
data_total <- data_total[,c('Accession', 'Description', 'Sum_Coverage', 'Sum_Proteins',
                          'Sum_Unique_Peptides', 'Sum_Peptides', 'AAs', 'MW_kDa', 'pI', 'Fraction_Number')]
data_total[1:5,] #view the first 5 rows
#>   Accession
#> 1    P02768
#> 2    P00915
#> 3    P31944
#> 4    Q14118
#> 5    P02671
#>                                                              Description
#> 1          Serum albumin OS=Homo sapiens GN=ALB PE=1 SV=2 - [ALBU_HUMAN]
#> 2   Carbonic anhydrase 1 OS=Homo sapiens GN=CA1 PE=1 SV=2 - [CAH1_HUMAN]
#> 3         Caspase-14 OS=Homo sapiens GN=CASP14 PE=1 SV=2 - [CASPE_HUMAN]
#> 4          Dystroglycan OS=Homo sapiens GN=DAG1 PE=1 SV=2 - [DAG1_HUMAN]
#> 5 Fibrinogen alpha chain OS=Homo sapiens GN=FGA PE=1 SV=2 - [FIBA_HUMAN]
#>   Sum_Coverage Sum_Proteins Sum_Unique_Peptides Sum_Peptides AAs MW_kDa   pI
#> 1      22.99 %            6                  16           16 609   69.3 6.28
#> 2      35.63 %            3                   7            7 261   28.9 7.12
#> 3      14.46 %            1                   4            4 242   27.7 5.58
#> 4       3.80 %            1                   3            3 895   97.4 8.56
#> 5       2.54 %            1                   2            2 866   94.9 6.01
#>   Fraction_Number
#> 1               1
#> 2               1
#> 3               1
#> 4               1
#> 5               1

The column ‘Sum_Coverage’ should be numeric. Also, we want shorter ‘Description’ of proteins: the first part of the string before ‘OS=’.

data_total$Sum_Coverage <- sapply(data_total$Sum_Coverage, function(x) as.numeric(gsub("%", "", x))) #convert percentage to number
data_total$Description_Trimmed <- sapply(data_total$Description, function(x) strsplit(x, " OS=")[[1]][1]) #add additional row with description
data_total[1:5,c('Accession', 'Sum_Coverage', 'Description_Trimmed')] # check changed columns
#>   Accession Sum_Coverage    Description_Trimmed
#> 1    P02768        22.99          Serum albumin
#> 2    P00915        35.63   Carbonic anhydrase 1
#> 3    P31944        14.46             Caspase-14
#> 4    Q14118         3.80           Dystroglycan
#> 5    P02671         2.54 Fibrinogen alpha chain

We may want to remove some proteins in which we are not interested. We can do it based on their ID, their description or some other column. For example, we may want to remove all proteins described as “keratin” and keep only the first 20 proteins from each fraction based on ‘Sum_Peptides’.

data_total <- data_total[!grepl('keratin', tolower(data_total$Description)),] #remove proteins described as 'keratin' (regardless of uppercase characters)
data_top20 <- data.frame(matrix(ncol=ncol(data_total), nrow=0)) #create new empty data frame
colnames(data_top20) <- colnames(data_total)
fraction_names <- unique(data_total$Fraction_Number) #all fractions
for(i in 1:length(fraction_names)){
  df_temp <- data_total[data_total$Fraction_Number==fraction_names[i],] #only proteins in the ith fraction
  df_temp <- df_temp[order(df_temp$Sum_Peptides, decreasing=TRUE),] #order proteins by Sum_Peptides
  data_top20 <- rbind(data_top20, df_temp[1:min(20, nrow(df_temp)),]) #keep the first 20 proteins
}

The resulting two data frames are included in the package and can be loaded with ‘data’ function.