这篇推文基本上是搬运自生信技能树的小洁老师,因为我自己并不太了解这一块,只能说是照猫画虎,因此我几乎没做任何修改
这个是一个能让编程更加优雅的小体系,主要包含stringr,dplyr,以及tidyr。stringr主要是处理字符串的,dplyr主要是处理数据框的,tidyr主要是处理数据框的格式的。Tidyverse体系下的代码极其符合英语的规律,看一眼就知道是啥意思,这里给到顶级!
library(stringr)# 构造示例数据 title <- c("A375 cells 24h Control rep1","A375 cells 24h Control rep2","A375 cells 24h Control rep3","A375 cells 24h Vemurafenib rep1","A375 cells 24h Vemurafenib rep2","A375 cells 24h Vemurafenib rep3" ) title [1] "A375 cells 24h Control rep1" "A375 cells 24h Control rep2" [3] "A375 cells 24h Control rep3" "A375 cells 24h Vemurafenib rep1" [5] "A375 cells 24h Vemurafenib rep2" "A375 cells 24h Vemurafenib rep3"# 将每个元素按空格拆分,简化为矩阵 str_split(title, " ",simplify = TRUE) [,1] [,2] [,3] [,4] [,5] [1,] "A375" "cells" "24h" "Control" "rep1" [2,] "A375" "cells" "24h" "Control" "rep2" [3,] "A375" "cells" "24h" "Control" "rep3" [4,] "A375" "cells" "24h" "Vemurafenib" "rep1" [5,] "A375" "cells" "24h" "Vemurafenib" "rep2" [6,] "A375" "cells" "24h" "Vemurafenib" "rep3" str_split_i(title," ",4) #单独把第4个元素拆出来 [1] "Control" "Control" "Control" "Vemurafenib" "Vemurafenib" [6] "Vemurafenib"# 将每个元素按空格拆分 title_words <- str_split(title, " ")# 取title_words的第一个元素作为后续的示例数据 title2 <- title_words[[1]] title2 [1] "A375" "cells" "24h" "Control" "rep1"# 检测每个元素是否包含h str_detect(title2, "h") [1] FALSE FALSE TRUE FALSE FALSE# 将每个词中首次出现的 'o' 替换为 'A' str_replace(title2, "o", "A") [1] "A375" "cells" "24h" "CAntrol" "rep1"# 删除每个词中首次出现的所有 'o' str_remove(title2, "o") [1] "A375" "cells" "24h" "Cntrol" "rep1"# 删除每个词中所有出现的 'o' str_remove_all(title2, "o") [1] "A375" "cells" "24h" "Cntrl" "rep1" library(dplyr)test <- iris[c(1:2,51:52,101:102),] mutate(test, new = Sepal.Length * Sepal.Width) Sepal.Length Sepal.Width Petal.Length Petal.Width Species new 1 5.1 3.5 1.4 0.2 setosa 17.85 2 4.9 3.0 1.4 0.2 setosa 14.70 51 7.0 3.2 4.7 1.4 versicolor 22.40 52 6.4 3.2 4.5 1.5 versicolor 20.48 101 6.3 3.3 6.0 2.5 virginica 20.79 102 5.8 2.7 5.1 1.9 virginica 15.66# 按 Sepal.Length 升序排列 arrange(test, Sepal.Length) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 4.9 3.0 1.4 0.2 setosa 2 5.1 3.5 1.4 0.2 setosa 3 5.8 2.7 5.1 1.9 virginica 4 6.3 3.3 6.0 2.5 virginica 5 6.4 3.2 4.5 1.5 versicolor 6 7.0 3.2 4.7 1.4 versicolor# 按 Sepal.Length 降序排列 arrange(test, desc(Sepal.Length)) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 7.0 3.2 4.7 1.4 versicolor 2 6.4 3.2 4.5 1.5 versicolor 3 6.3 3.3 6.0 2.5 virginica 4 5.8 2.7 5.1 1.9 virginica 5 5.1 3.5 1.4 0.2 setosa 6 4.9 3.0 1.4 0.2 setosa# 根据Species列对整个数据框去重复 distinct(test, Species, .keep_all = TRUE) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 7.0 3.2 4.7 1.4 versicolor 3 6.3 3.3 6.0 2.5 virginica管道操作是把左侧表达式的结果,作为右侧函数的输入,从而把一连串数据处理步骤“串起来”。快捷键是(ctrl + shift + M)。
下列两句代码的效果相同:
head(test,2) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosatest %>% head(2) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa管道符号也支持连续使用多个,避免代码层层嵌套。
#先按照Species分组,再计算每组的Sepal.Length平均值test %>% group_by(Species) %>% summarise(sepal_len_mean = mean(Sepal.Length)) A tibble: 3 × 2 Species sepal_len_mean <fct> <dbl> 1 setosa 5 2 versicolor 6.7 3 virginica 6.05 library(tidyr) mat <- matrix(c(1, 4, 7, 10, 2, 5, 0.8, 11, 0.3, 6, 9, 12), nrow = 4, dimnames = list(paste0("gene", 1:4), paste0("sample",1:3)))test = as.data.frame(mat)#将行名转换为一列test = tibble::rownames_to_column(test,"geneid") geneid sample1 sample2 sample3 1 gene1 1 2.0 0.3 2 gene2 4 5.0 6.0 3 gene3 7 0.8 9.0 4 gene4 10 11.0 12.0使用 pivot_longer 将宽格式转为长格式,使样本名落入一列,表达值落入另一列,适合ggplot2绘图。
test1 <- pivot_longer(data = test, cols = -geneid, names_to = "sample_nm", values_to = "exp") head(test1) A tibble: 6 × 3 geneid sample_nm exp <chr> <chr> <dbl> 1 gene1 sample1 1 2 gene1 sample2 2 3 gene1 sample3 0.3 4 gene2 sample1 4 5 gene2 sample2 5 6 gene2 sample3 6使用 pivot_wider 将长格式转为宽格式。宽格式的数据更直观易读,适合画热图。
test2 <- pivot_wider(data = test1, names_from = sample_nm, values_from = exp) test2 A tibble: 4 × 4 geneid sample1 sample2 sample3 <chr> <dbl> <dbl> <dbl> 1 gene1 1 2 0.3 2 gene2 4 5 6 3 gene3 7 0.8 9 4 gene4 10 11 12