Pipeline integration (targets / drake)
Source:vignettes/v06_pipeline_integration.Rmd
v06_pipeline_integration.RmdPipeline integration
vennDiagramLab is library-first and tidyverse-friendly.
The broom-compatible S3 methods on
RegionResult make it trivial to plug into
targets / drake workflows or any pipeline that
expects tidy data.
library(vennDiagramLab)
result <- analyze(load_sample("dataset_real_cancer_drivers_4"))broom methods
Three methods convert a RegionResult to a tibble at
three different levels of aggregation:
-
tidy(result)— one row per set pair, all five pairwise metrics -
glance(result)— one row, headline numbers -
augment(result)— one row per item, set-membership flags + region label
broom::glance(result)
#> # A tibble: 1 × 8
#> n_sets n_regions n_items universe_size model is_approximate n_significant
#> <int> <int> <int> <int> <chr> <lgl> <int>
#> 1 4 15 20000 20000 venn-4-set FALSE 6
#> # ℹ 1 more variable: n_highly_significant <int>
head(broom::tidy(result))
#> # A tibble: 6 × 12
#> set_a set_b intersection expected jaccard dice overlap_coefficient
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 COSMIC_CGC OncoKB 581 35.8 0.472 0.641 1
#> 2 COSMIC_CGC IntOGen 388 18.4 0.470 0.639 0.668
#> 3 OncoKB IntOGen 477 39.0 0.344 0.512 0.754
#> 4 Vogelstein COSMIC_CGC 126 4.01 0.212 0.350 0.913
#> 5 Vogelstein IntOGen 123 4.37 0.190 0.319 0.891
#> 6 Vogelstein OncoKB 131 8.49 0.106 0.191 0.949
#> # ℹ 5 more variables: fold_enrichment <dbl>, p_value <dbl>, p_adjusted <dbl>,
#> # significant <lgl>, highly_significant <lgl>
head(broom::augment(result))
#> # A tibble: 6 × 6
#> item Vogelstein COSMIC_CGC OncoKB IntOGen region_label
#> <chr> <lgl> <lgl> <lgl> <lgl> <chr>
#> 1 A1CF FALSE FALSE TRUE FALSE C
#> 2 AAMP FALSE FALSE TRUE FALSE C
#> 3 ABCB1 FALSE FALSE TRUE FALSE C
#> 4 ABCC3 FALSE FALSE TRUE FALSE C
#> 5 ABCC4 FALSE FALSE FALSE TRUE D
#> 6 ABI1 FALSE TRUE TRUE FALSE BCCombining with dplyr
If you want to filter to only the highly significant pairs:
broom::tidy(result) |>
dplyr::filter(highly_significant) |>
dplyr::arrange(dplyr::desc(jaccard)) |>
dplyr::select(set_a, set_b, intersection, jaccard, p_adjusted)
#> # A tibble: 6 × 5
#> set_a set_b intersection jaccard p_adjusted
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 COSMIC_CGC OncoKB 581 0.472 0
#> 2 COSMIC_CGC IntOGen 388 0.470 0
#> 3 OncoKB IntOGen 477 0.344 0
#> 4 Vogelstein COSMIC_CGC 126 0.212 1.01e-183
#> 5 Vogelstein IntOGen 123 0.190 5.54e-171
#> 6 Vogelstein OncoKB 131 0.106 3.13e-151Or count items per region:
targets pipeline (sketch)
A simple _targets.R file:
library(targets)
list(
tar_target(ds, load_sample("dataset_real_cancer_drivers_4")),
tar_target(result, analyze(ds)),
tar_target(stats_df, broom::tidy(result)),
tar_target(genes_df, broom::augment(result)),
tar_target(venn_svg, render_venn_svg(result)),
tar_target(venn_path,
{ writeLines(venn_svg, "venn.svg"); "venn.svg" },
format = "file")
)Run with targets::tar_make(). Each step caches
independently, so re-running after only changing the sort order in a
downstream report does not re-run the analysis.
Caching tip
statistics(result) recomputes on every call (no S4
lazy-property equivalent). If you call it many times, cache it once:
stats <- statistics(result)
str(stats@jaccard, max.level = 1)
#> num [1:4, 1:4] 1 0.212 0.106 0.19 0.212 ...
#> - attr(*, "dimnames")=List of 2Inside a targets pipeline, this is a non-issue because
tar_target(stats, statistics(result)) caches it for
you.
What’s next
-
vignette("v05_statistics_deep_dive")— what the metrics inbroom::tidy()actually mean. -
vignette("v07_pdf_reports")— turning a result into a PDF artifact for a pipeline.