--- title: "Data Resource" author: "Peter Desmet" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data Resource} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` [Data Resource](https://specs.frictionlessdata.io/data-resource/) is a simple format to describe a data resource such as an individual table or file, including its name, format, path, etc. ::: {.callout-info} In this document we use the terms "package" for Data Package, "resource" for Data Resource, "dialect" for Table Dialect, and "schema" for Table Schema. ::: ## General implementation Frictionless supports reading, manipulating and writing resources, but much of its functionality is limited to [Tabular Data Resources](https://specs.frictionlessdata.io/tabular-data-resource/). ### Read `resources()` lists all resources in a package: ```{r} library(frictionless) package <- example_package() # List the resources resources(package) ``` `read_resource()` reads data from a tabular resource to a data frame: ```{r} read_resource(package, "deployments") ``` Frictionless does not support reading data from non-tabular resources. ### Manipulate `remove_resource()` removes a resource (of any type): ```{r} remove_resource(package, "deployments") # This and many other functions return "package", which you can update with # package <- remove_resource(package, "deployments") ``` `add_resource()` adds or replaces a tabular resource. The provided data must be a data frame or a tabular data file (e.g. CSV): ```{r} # Add a resource with data from a data frame add_resource(package, "iris", data = iris) # Replace a resource with one where data is stored in a tabular file path <- system.file("extdata", "v1", "deployments.csv", package = "frictionless") add_resource(package, "deployments", data = path, replace = TRUE) ``` ::: {.callout-info} You can pipe most functions (see `vignette("data-package")`). ::: ### Write `write_package()` writes a package to disk as a `datapackage.json` file. This file includes the metadata of all the resources. `write_package()` also writes resource data to CSV files, unless the referred data are referred to be URL or inline. See the function documentation for details. ## Properties implementation ### name [`name`](https://specs.frictionlessdata.io/data-resource/#name) is required. It is used to identify a resource in `read_resource()`, `add_resource()` and `remove_resource()` (always as the second argument): ```{r} deployments <- read_resource(package, resource_name = "deployments") ``` `add_resource()` sets `name` to the provided `resource_name`: ```{r} add_resource(package, resource_name = "iris", data = iris) ``` ### path [`path`](https://specs.frictionlessdata.io/data-resource/#data-location) or `data` (see further) is required. Providing both is not allowed. `path` is for data in files (e.g. a CSV file). It can be a local path or URL. Supported protocols are `http`, `https`, `ftp`, `sftp` and `sftp`. Absolute paths (`/`) or relative parent paths (`../`) are not allowed to avoid security vulnerabilities. When multiple paths are provided (`"path": ["myfile1.csv", "myfile2.csv"]`), the files are expected to have the same structure. `read_resource()` merges these into a single data frame in the order the paths are provided (using `dplyr::bind_rows()`): ```{r} # The "observations" resource has multiple files in path package$resources[[2]]$path # These are combined into a single data frame when reading read_resource(package, "observations") ``` `add_resource()` sets `path` to the path(s) provided in `data`: ```{r} path <- system.file("extdata", "v1", "deployments.csv", package = "frictionless") add_resource(package, "deployments", data = path, replace = TRUE) ``` ### data ::: {.callout-warning} Support for inline `data` is currently limited, e.g. JSON object and string are not supported and `schema`, `mediatype` and `format` are ignored. ::: `data` is for inline data (included in the `datapackage.json`). `read_resource()` attempts to read `data` if it is provided as a JSON array: ```{r} # The "media" resource has inline data str(package$resources[[3]]$data) read_resource(package, "media") ``` `add_resource()` adds the provided data frame to `data`: ```{r} df <- data.frame("col_1" = c(1, 2), "col_2" = c("a", "b")) package <- add_resource(package, "df", df) package$resources[[4]]$data ``` `write_package()` writes that data frame to a CSV file, adds its path to `path` and removes `data`. ### profile [`profile`](https://specs.frictionlessdata.io/data-resource/#profile) is required to have the value `"tabular-data-resource"`. `add_resource()` sets `profile` to that value. ### schema [`schema`](https://specs.frictionlessdata.io/data-resource/#resource-schemas) is required. It is used by `read_resource()` to parse data types and missing values. It can either be a JSON object or a path or URL referencing a JSON object. See `vignette("table-schema")` for details. ### dialect [`dialect`](https://specs.frictionlessdata.io/tabular-data-resource/#csv-dialect) is used by `read_resource()` to parse a tabular data file. It can either be a JSON object or a path or URL referencing a JSON object. See `vignette("table-dialect")` for details. ### title [`title`](https://specs.frictionlessdata.io/data-resource/#optional-properties) is ignored by `read_resource()` and not set by `add_resource()`, unless provided: ```{r} add_resource( package, "iris", iris, title = "Edgar Anderson's Iris Data", replace = TRUE ) ``` ### description [`description`](https://specs.frictionlessdata.io/data-resource/#optional-properties) is ignored by `read_resource()` and not set by `add_resource()` unless provided (cf. `title`). ### format [`format`](https://specs.frictionlessdata.io/data-resource/#optional-properties) is ignored by `read_resource()`. `add_resource()` sets `format` when data are provided as a file, based on the provided `delim`: delim | format --- | --- `","` (default) | `"csv"` `"\t"` | `"tsv"` any other value | `"csv"` ```{r} path <- system.file("extdata", "v1", "observations_1.tsv", package = "frictionless") package <- add_resource(package, "observations", data = path, delim = "\t", replace = TRUE) package$resources[[2]]$format ``` `add_resource()` leaves `format` undefined when data are provided as a data frame. `write_package()` sets it to `"csv"` when writing to disk. ### mediatype [`mediatype`](https://specs.frictionlessdata.io/data-resource/#optional-properties) is ignored by `read_resource()`. `add_resource()` sets `mediatype` when data are provided as a file, based on the provided `delim`: delim | mediatype --- | --- `","` (default) | `"text/csv"` `"\t"` | `"text/tab-separated-values"` any other value | `"text/csv"` ```{r} path <- system.file("extdata", "v1", "observations_1.tsv", package = "frictionless") package <- add_resource(package, "observations", data = path, delim = "\t", replace = TRUE) package$resources[[2]]$mediatype ``` `add_resource()` leaves `mediatype` undefined when data are provided as a data frame. `write_package()` sets it to `"text/csv"` when writing to disk. ### encoding [`encoding`](https://specs.frictionlessdata.io/data-resource/#optional-properties) (e.g. `"windows-1252"`) is used by `read_resource()` to parse the file. It defaults to UTF-8 if no `encoding` is provided or if it cannot be recognized. The returned data frame is always UTF-8. `add_resource()` guesses the `encoding` (using `readr::guess_encoding()`) when data are provided as file. It leaves the `encoding` undefined when data are provided as a data frame. `write_package()` sets it to `"utf-8"` when writing to disk. ```{r} path <- system.file("extdata", "v1", "deployments.csv", package = "frictionless") package <- add_resource(package, "deployments", data = path, delim = ",", replace = TRUE) package$resources[[2]]$encoding ``` ### bytes [`bytes`](https://specs.frictionlessdata.io/data-resource/#optional-properties) is ignored by `read_resource()` and not set by `add_resource()` unless provided (cf. `title`). ### hash [`hash`](https://specs.frictionlessdata.io/data-resource/#optional-properties) is ignored by `read_resource()` and not set by `add_resource()` unless provided (cf. `title`). ### sources [`sources`](https://specs.frictionlessdata.io/data-resource/#optional-properties) is ignored by `read_resource()` and not set by `add_resource()` unless provided (cf. `title`). ### licenses [`licenses`](https://specs.frictionlessdata.io/data-resource/#optional-properties) is ignored by `read_resource()` and not set by `add_resource()` unless provided (cf. `title`). ### compression [`compression`](https://specs.frictionlessdata.io/patterns/#specification-3) (a recipe) is ignored by `read_resource()` and not set by `add_resource()`. Compression is derived from the provided `path` instead. If the `path` ends in `.gz`, `.bz2`, `.xz`, or `.zip`, the files are automatically decompressed by `read_resource()` (using default `readr::read_delim()` functionality). Only `.gz` files can be read directly from URL `path`s.