<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://brycemecum.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://brycemecum.com/" rel="alternate" type="text/html" /><updated>2026-02-26T05:27:32+00:00</updated><id>https://brycemecum.com/feed.xml</id><title type="html">brycemecum.com</title><subtitle>The personal site of Bryce Mecum</subtitle><entry><title type="html">Using ADBC with $5 Planetscale</title><link href="https://brycemecum.com/2025/11/15/adbc-planetscale/" rel="alternate" type="text/html" title="Using ADBC with $5 Planetscale" /><published>2025-11-15T00:00:00+00:00</published><updated>2025-11-15T00:00:00+00:00</updated><id>https://brycemecum.com/2025/11/15/adbc-planetscale</id><content type="html" xml:base="https://brycemecum.com/2025/11/15/adbc-planetscale/"><![CDATA[<p><img src="/assets/adbc-planetscale/adbc-planetscale.png" alt="" /></p>

<p><a href="https://planetscale.com">Planetscale</a> recently <a href="https://bsky.app/profile/planetscale.com/post/3m5mjtfj3gs2r">announced</a> their new $5 Postgres instance (PS-5) and I wanted to give it a test.</p>

<p>Since I’m working on <a href="https://arrow.apache.org/adbc">ADBC</a>, my first question was whether Planetscale $5 Postgres would work with the <a href="https://arrow.apache.org/adbc/current/driver/postgresql.html">ADBC PostgreSQL driver</a>.</p>

<p>Here’s what I did:</p>

<p>After creating a new $5 instance, I clicked Connect and created a role with the following permissions:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">pg_read_all_data</code> (Read data from all tables, views, and sequences.)</li>
  <li><code class="language-plaintext highlighter-rouge">pg_write_all_data</code> (Write data to all tables, views, and sequences.)</li>
  <li><code class="language-plaintext highlighter-rouge">postgres</code> (Create, modify, and drop databases, users, roles, tables, schemas, and all other objects.)</li>
</ul>

<p>Note: That last role (postgres) will be key since I want to test ingesting data into my instance.</p>

<p>When the Connect wizard asked me how I was connecting, I selected Python. At this point, the instructions show how to use the <code class="language-plaintext highlighter-rouge">psycopg2-binary</code> and they provide this code:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">psycopg2</span>
<span class="kn">from</span> <span class="n">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span>

<span class="c1"># Load environment variables from the .env file
</span><span class="nf">load_dotenv</span><span class="p">()</span>

<span class="n">conn</span> <span class="o">=</span> <span class="n">psycopg2</span><span class="p">.</span><span class="nf">connect</span><span class="p">(</span>
  <span class="n">host</span><span class="o">=</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">"</span><span class="s">DATABASE_HOST</span><span class="sh">"</span><span class="p">),</span>
  <span class="n">port</span><span class="o">=</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">"</span><span class="s">DATABASE_PORT</span><span class="sh">"</span><span class="p">),</span>
  <span class="n">user</span><span class="o">=</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">"</span><span class="s">DATABASE_USERNAME</span><span class="sh">"</span><span class="p">),</span>
  <span class="n">password</span><span class="o">=</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">"</span><span class="s">DATABASE_PASSWORD</span><span class="sh">"</span><span class="p">),</span>
  <span class="n">dbname</span><span class="o">=</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">"</span><span class="s">DATABASE</span><span class="sh">"</span><span class="p">),</span>
<span class="p">)</span>

<span class="n">cur</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="nf">cursor</span><span class="p">()</span>
<span class="n">cur</span><span class="p">.</span><span class="nf">execute</span><span class="p">(</span><span class="sh">"</span><span class="s">SELECT version();</span><span class="sh">"</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="n">cur</span><span class="p">.</span><span class="nf">fetchone</span><span class="p">())</span>

<span class="n">cur</span><span class="p">.</span><span class="nf">close</span><span class="p">()</span>
<span class="n">conn</span><span class="p">.</span><span class="nf">close</span><span class="p">()</span>
</code></pre></div></div>

<p>Because ADBC drivers use the same underlying protocols as the database their targetting,</p>

<ol>
  <li>We can swap <code class="language-plaintext highlighter-rouge">psycopg2-binary</code> out for the ADBC PostgreSQL driver and it should just work</li>
  <li>We can use mostly the same code and exactly the same SQL</li>
</ol>

<p>I installed the ADBC PostgreSQL driver using <a href="https://docs.columnar.tech/dbc">dbc</a>, a new command line tool we’re building to make working with ADBC drivers easier. It’s also available <a href="https://pypi.org/project/adbc-driver-postgresql">on PyPI</a>. I also did this in a venv to keep it contained (using <a href="https://astral.sh/uv">uv</a>):</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>uv venv
<span class="gp">$</span><span class="w"> </span><span class="nb">source</span> .venv/bin/activate
<span class="gp">$</span><span class="w"> </span>dbc <span class="nb">install </span>postgresql
<span class="go">[✓] searching
[✓] downloading
[✓] installing
[✓] verifying signature

Installed postgresql 1.8.0 to /Users/bryce/planetscale-adbc/.venv/etc/adbc/drivers
</span></code></pre></div></div>

<p>This installed the driver into my new virtual environment as you can see above.</p>

<p>I then installed a few more packages for my test:</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>uv pip <span class="nb">install </span>adbc-driver-manager pyarrow
</code></pre></div></div>
<p>In their wizard, Planetscale gave me a set of environment variables for the connection so I stored those in a <code class="language-plaintext highlighter-rouge">.env</code> file and loaded them in my shell so they were available to Python below. For this I use <a href="https://direnv.net">direnv</a>.</p>

<p>To do my test, I connected:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">os</span>
<span class="kn">from</span> <span class="n">adbc_driver_manager</span> <span class="kn">import</span> <span class="n">dbapi</span>

<span class="n">URI</span><span class="o">=</span><span class="sa">f</span><span class="sh">"</span><span class="s">postgresql://</span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">'</span><span class="s">DATABASE_USERNAME</span><span class="sh">'</span><span class="p">)</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">'</span><span class="s">DATABASE_PASSWORD</span><span class="sh">'</span><span class="p">)</span><span class="si">}</span><span class="s">@</span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">'</span><span class="s">DATABASE_HOST</span><span class="sh">'</span><span class="p">)</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">'</span><span class="s">DATABASE_PORT</span><span class="sh">'</span><span class="p">)</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">'</span><span class="s">DATABASE</span><span class="sh">'</span><span class="p">)</span><span class="si">}</span><span class="sh">"</span>

<span class="n">con</span> <span class="o">=</span> <span class="n">dbapi</span><span class="p">.</span><span class="nf">connect</span><span class="p">(</span><span class="n">driver</span><span class="o">=</span><span class="sh">"</span><span class="s">postgresql</span><span class="sh">"</span><span class="p">,</span> <span class="n">uri</span><span class="o">=</span><span class="n">URI</span><span class="p">)</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">con</span><span class="p">.</span><span class="nf">cursor</span><span class="p">()</span>
<span class="n">cur</span><span class="p">.</span><span class="nf">execute</span><span class="p">(</span><span class="sh">"</span><span class="s">SELECT version();</span><span class="sh">"</span><span class="p">).</span><span class="nf">fetchone</span><span class="p">()</span>
<span class="c1"># =&gt; ('PostgreSQL 17.5 (Debian 17.5-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit',)
</span></code></pre></div></div>

<p>Ingested a <a href="https://parquet.apache.org">Parquet</a> file (<a href="https://allisonhorst.github.io/palmerpenguins/">Palmer Penguins</a>, of course):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">pyarrow.parquet</span> <span class="k">as</span> <span class="n">pq</span>

<span class="n">tbl</span> <span class="o">=</span> <span class="n">pq</span><span class="p">.</span><span class="nf">read_table</span><span class="p">(</span><span class="sh">"</span><span class="s">./penguins.parquet</span><span class="sh">"</span><span class="p">)</span>
<span class="n">cur</span><span class="p">.</span><span class="nf">adbc_ingest</span><span class="p">(</span><span class="sh">"</span><span class="s">penguins</span><span class="sh">"</span><span class="p">,</span> <span class="n">tbl</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="sh">"</span><span class="s">create</span><span class="sh">"</span><span class="p">)</span>
<span class="c1"># =&gt; 344
</span></code></pre></div></div>

<p>And then read it back:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tbl</span> <span class="o">=</span> <span class="n">cur</span><span class="p">.</span><span class="nf">execute</span><span class="p">(</span><span class="sh">"</span><span class="s">select * from penguins</span><span class="sh">"</span><span class="p">).</span><span class="nf">fetch_arrow_table</span><span class="p">()</span>
<span class="n">tbl</span><span class="p">.</span><span class="n">num_rows</span>
<span class="c1"># =&gt; 344
</span><span class="n">tbl</span>
</code></pre></div></div>

<p>Which prints:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pyarrow.Table
species: string
island: string
bill_len: double
bill_dep: double
flipper_len: int64
body_mass: int64
sex: string
year: int64
----
species: [["Adelie","Adelie","Adelie","Adelie","Adelie",...,"Chinstrap","Chinstrap","Chinstrap","Chinstrap","Chinstrap"]]
island: [["Torgersen","Torgersen","Torgersen","Torgersen","Torgersen",...,"Dream","Dream","Dream","Dream","Dream"]]
bill_len: [[39.1,39.5,40.3,null,36.7,...,55.8,43.5,49.6,50.8,50.2]]
bill_dep: [[18.7,17.4,18,null,19.3,...,19.8,18.1,18.2,19,18.7]]
flipper_len: [[181,186,195,null,193,...,207,202,193,210,198]]
body_mass: [[3750,3800,3250,null,3450,...,4000,3400,3775,4100,3775]]
sex: [["male","female","female","NA","female",...,"male","female","male","male","female"]]
year: [[2007,2007,2007,2007,2007,...,2009,2009,2009,2009,2009]]
</code></pre></div></div>

<p>I’d say that was a successful test.</p>

<p>Now, connecting to a $5 PostgreSQL instance may not be the most realistic demonstration of how to use ADBC but I hope the above shows the value of interfaces. Notably, here are the things I didn’t have to do:</p>

<ol>
  <li>Learn a new Python database API (both psycogp2 and ADBC speak <a href="https://peps.python.org/pep-0249/">PEP 249</a>)</li>
  <li>Figure out how to get my Parquet data converted into whatever format PostgreSQL needs</li>
</ol>

<p>And now, also because of interfaces, here’s what I can now easily do:</p>

<ol>
  <li>Work with it directly with <a href="https://arrow.apache.org/docs/python">PyArrow</a></li>
  <li>Work with this data in <a href="https://duckdb.org">DuckDB</a> without copying</li>
  <li>Work with this data in <a href="https://pola.rs">Polars</a> without copying</li>
</ol>

<p>And the list can go on because ADBC speaks <a href="https://arrow.apache.org">Arrow</a> and increasing amounts of the data engineering stack are speaking Arrow.</p>]]></content><author><name></name></author><category term="arrow" /><category term="adbc" /><category term="postgresql" /><category term="database" /><category term="planetscale" /><category term="python" /><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/adbc-planetscale.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/adbc-planetscale.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">TIL: Mermaid Gantt diagrams are great for displaying distributed traces in Markdown</title><link href="https://brycemecum.com/2023/03/31/til-mermaid-tracing/" rel="alternate" type="text/html" title="TIL: Mermaid Gantt diagrams are great for displaying distributed traces in Markdown" /><published>2023-03-31T00:00:00+00:00</published><updated>2023-03-31T00:00:00+00:00</updated><id>https://brycemecum.com/2023/03/31/til-mermaid-tracing</id><content type="html" xml:base="https://brycemecum.com/2023/03/31/til-mermaid-tracing/"><![CDATA[<p>Today I noticed via <a href="https://twitter.com/mitsuhiko/status/1641040644121436160">a tweet</a> by <a href="https://twitter.com/mitsuhiko">@mitsuhiko</a> that <a href="https://mermaid.js.org/">Mermaid</a> Gantt diagrams are great for displaying distributed trace information like what you’d get from <a href="https://github.com/jaegertracing/jaeger-ui">JaegerUI</a>.
I’ve been working with <a href="https://opentelemetry.io">OpenTelemetry</a> a fair bit recently and, in recent projects, I’ve been including screenshots of JaegerUI whenever I need to show a distributed trace in my documentation.
This generally works fine but I’m happy to have an alternative that’s more at home in Markdown and on the web.</p>

<p>If you’re not familiar with Mermaid, they have <a href="https://mermaid.js.org/intro/">great docs</a>.</p>

<h2 id="gantt-diagrams">Gantt Diagrams</h2>

<p>Gantt diagrams are typically used for scheduling multiple tasks along a shared timeline. In hindsight, it makes total sense to reach for a Gantt diagrams for diagraming a distributed trace.</p>

<p>The Mermaid syntax for a pretty typical Gantt looks like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gantt
    title A Gantt Diagram
    dateFormat  YYYY-MM-DD
    section Section
    A task             : a1, 2014-01-01, 30d
    Another task       : after a1  , 20d
    section Another
    Task in sec        : 2014-01-12  , 12d
    another task       : 24d
</code></pre></div></div>

<p>and, when rendered, looks like:</p>

<script type="module">import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';</script>
<div class="mermaid">
gantt
    title A Gantt Diagram
    dateFormat  YYYY-MM-DD
    section Section
    A task             : a1, 2014-01-01, 30d
    Another task       : after a1  , 20d
    section Another
    Task in sec        : 2014-01-12  , 12d
    another task       : 24d
</div>

<h2 id="a-basic-trace-diagram">A Basic Trace Diagram</h2>

<p>The <a href="https://twitter.com/mitsuhiko/status/1641040644121436160">tweet</a> I mentioned previously shows code for a Gantt diagrams of a simple trace:</p>

<script type="module">import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';</script>
<div class="mermaid">
gantt
    title Trace Showing Attached and Detached Spans
    dateFormat x
    axisFormat %S.%L

    section Frontend
    /checkout               :crit, 0, 500ms
    App                     :300, 180ms
    POST /api/analytics     :done, 450, 70ms
    GET /assistant/poll     :done, 450, 120ms
    POST /api/analytics     :done, 580, 70ms
</div>

<p>The code for which is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gantt
    title Trace Showing Attached and Detached Spans
    dateFormat x
    axisFormat %S.%L

    section Frontend
    /checkout               :crit, 0, 500ms
    App                     :300, 180ms
    POST /api/analytics     :done, 450, 70ms
    GET /assistant/poll     :done, 450, 120ms
    POST /api/analytics     :done, 580, 70ms
</code></pre></div></div>

<p>To do this, the code above uses just a few features of Mermaid’s <a href="https://mermaid.js.org/syntax/gantt.html">Gantt syntax</a> to make the diagram look less like a typical Gantt diagrams and more like an OpenTelemetry trace:</p>

<ol>
  <li>To show everything on a time scale instead of a calender date scale:
    <ul>
      <li>Specify a <code class="language-plaintext highlighter-rouge">dateFormat</code> of <code class="language-plaintext highlighter-rouge">x</code> (milliseconds) instead of the usual <code class="language-plaintext highlighter-rouge">YYYY-MM-DD</code></li>
      <li>Specify an <code class="language-plaintext highlighter-rouge">axisFormat</code> of <code class="language-plaintext highlighter-rouge">%S.%L</code> which makes the chart use seconds with milliseconds instead of dates</li>
    </ul>
  </li>
  <li>Separate each service into its own <code class="language-plaintext highlighter-rouge">section</code></li>
  <li>Visually distinguish spans using tags like <code class="language-plaintext highlighter-rouge">:crit</code>, <code class="language-plaintext highlighter-rouge">:done</code> which apply styling by default</li>
</ol>

<h2 id="a-more-realistic-example">A More Realistic Example</h2>

<p><a href="https://twitter.com/mitsuhiko">@mitsuhiko</a> also linked to a <a href="http://sentry.io">Sentry</a> <a href="https://github.com/getsentry/rfcs/blob/7e215e6e8fd54f8adec9f7dc0ed3505d76540717/text/0083-starfish-tracing-model.md#trace">RFC</a> that’s in the works with a more representative example:</p>

<script type="module">import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';</script>
<div class="mermaid">
gantt
    title Example Starfish Trace
    dateFormat x
    axisFormat %S.%L

    section Frontend
    /checkout                                        :crit, 0, 1500ms
    GET /api/session                                 :150, 170ms
    POST /api/analytics                              :190, 70ms
    GET /api/checkout/state                          :200, 500ms
    GET /api/checkout/cart                           :1100, 140ms
    App                                          :1300, 180ms
    POST /api/analytics                              :done, 1450, 70ms
    GET /assistant/poll                              :done, 1450, 120ms
    POST /api/analytics                              :done, 1580, 70ms

    section API Service
    /api/checkout/state                              :crit, 240, 440ms
    cache.get session#58;[redacted]                  :360, 10ms
    db.query select from session                     :370, 20ms
    db.query select from user                        :390, 20ms
    db.query select from checkout                    :410, 20ms
    http.request GET http#58;//payments/poll  :450, 210ms
    thread.spawn refresh-checkout-cache              :done, 670, 220ms

    section Payment Service
    /poll                                            :crit, 470, 180ms
    db.query select from payment                     :490, 30ms
    db.query update payment                          :530, 60ms
</div>

<p>which has the following code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gantt
    title Example Starfish Trace
    dateFormat x
    axisFormat %S.%L

    section Frontend
    /checkout                                        :crit, 0, 1500ms
    GET /api/session                                 :150, 170ms
    POST /api/analytics                              :190, 70ms
    GET /api/checkout/state                          :200, 500ms
    GET /api/checkout/cart                           :1100, 140ms
    App                                              :1300, 180ms
    POST /api/analytics                              :done, 1450, 70ms
    GET /assistant/poll                              :done, 1450, 120ms
    POST /api/analytics                              :done, 1580, 70ms

    section API Service
    /api/checkout/state                              :crit, 240, 440ms
    cache.get session#58;[redacted]                  :360, 10ms
    db.query select from session                     :370, 20ms
    db.query select from user                        :390, 20ms
    db.query select from checkout                    :410, 20ms
    http.request GET http#58;//payments/poll  :450, 210ms
    thread.spawn refresh-checkout-cache              :done, 670, 220ms

    section Payment Service
    /poll                                            :crit, 470, 180ms
    db.query select from payment                     :490, 30ms
    db.query update payment                          :530, 60ms
</code></pre></div></div>]]></content><author><name></name></author><category term="til" /><category term="jekyll" /><category term="mermaidjs" /><category term="opentelemetry" /><category term="tracing" /><summary type="html"><![CDATA[Today I noticed via a tweet by @mitsuhiko that Mermaid Gantt diagrams are great for displaying distributed trace information like what you’d get from JaegerUI. I’ve been working with OpenTelemetry a fair bit recently and, in recent projects, I’ve been including screenshots of JaegerUI whenever I need to show a distributed trace in my documentation. This generally works fine but I’m happy to have an alternative that’s more at home in Markdown and on the web.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/til-mermaid-tracing.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/til-mermaid-tracing.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Reflecting on Apache Arrow in 2022</title><link href="https://brycemecum.com/2023/02/06/reflecting-on-apache-arrow-in-2022/" rel="alternate" type="text/html" title="Reflecting on Apache Arrow in 2022" /><published>2023-02-06T00:00:00+00:00</published><updated>2023-02-06T00:00:00+00:00</updated><id>https://brycemecum.com/2023/02/06/reflecting-on-apache-arrow-in-2022</id><content type="html" xml:base="https://brycemecum.com/2023/02/06/reflecting-on-apache-arrow-in-2022/"><![CDATA[<p>In <a href="https://www.datawill.io/posts/apache-arrow-2022-reflection/">Reflecting on Apache Arrow in 2022</a>, <a href="https://www.datawill.io">Will Jones</a> does a really nice job providing a history of the <a href="https://arrow.apache.org">Apache Arrow</a> project and the broader ecosystem it was originally created to help foster.
It’s worth a read in full.</p>

<p>In his post, he describes the C++ Arrow ecosystem as being somewhat fractured and suggests this may be primarily out of the need for other teams to move fast but points out it may have something to do with libarrow’s attractiveness as a dependency.</p>

<p>One quote jumped out at me as particularly insightful is this one:</p>

<blockquote>
  <p>Yet those are all the same challenges our users experience; would it not be better if we felt those pains ourselves and had incentive to address them? I tend to think we would design better public APIs if we had to use them ourselves for our own query engine. <a href="https://www.datawill.io/posts/apache-arrow-2022-reflection/#who-is-libarrows-and-aceros-audience:~:text=Yet%20those%20are%20all%20the%20same%20challenges%20our%20users%20experience%3B%20would%20it%20not%20be%20better%20if%20we%20felt%20those%20pains%20ourselves%20and%20had%20incentive%20to%20address%20them%3F%20I%20tend%20to%20think%20we%20would%20design%20better%20public%20APIs%20if%20we%20had%20to%20use%20them%20ourselves%20for%20our%20own%20query%20engine.">#</a></p>
</blockquote>

<p>This immediately reminded me of something I think <a href="https://jennybryan.org/">Jenny Bryan</a> said (which I cannot currently find) about doing the hard things often so they aren’t hard anymore.
If integrating parts of the Arrow ecosystem with each is hard for members of the Arrow project, it’s likely to be considerably harder for those outside of it and I look forward to watching work on this front progress in 2023.</p>]]></content><author><name></name></author><category term="apache-arrow" /><summary type="html"><![CDATA[In Reflecting on Apache Arrow in 2022, Will Jones does a really nice job providing a history of the Apache Arrow project and the broader ecosystem it was originally created to help foster. It’s worth a read in full.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/reflecting-on-apache-arrow-in-2022.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/reflecting-on-apache-arrow-in-2022.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Writing QuickLook plugins in Go</title><link href="https://brycemecum.com/2022/09/19/quicklook-go/" rel="alternate" type="text/html" title="Writing QuickLook plugins in Go" /><published>2022-09-19T00:00:00+00:00</published><updated>2022-09-19T00:00:00+00:00</updated><id>https://brycemecum.com/2022/09/19/quicklook-go</id><content type="html" xml:base="https://brycemecum.com/2022/09/19/quicklook-go/"><![CDATA[<p>I recently wanted to write a <a href="https://support.apple.com/guide/mac-help/view-and-edit-files-with-quick-look-mh14119/mac">QuickLook</a> plugin for <a href="https://parquet.apache.org/">Apache Parquet</a> because I’m starting to use it more and more.
There are some neat third-party plugins out there like <a href="https://github.com/toland/qlmarkdown">QLMarkdown</a> and <a href="https://github.com/whomwah/qlstephen">QLStephen</a> so I sat down to figure out how to write my own.</p>

<p>My first questions were which programming language I’d have to use and how I could write as little new code as possible.</p>

<p>I first tried to vendor <a href="https://github.com/apache/arrow/tree/master/cpp">libarrow</a> and link against that but ran into issues making clang happy with the C++17 stdlib (which libarrow targets).
I have a feeling it could be made to work but I went back to the web and found a <a href="https://github.com/remko/qlmka">neat project</a> that used <a href="https://go.dev/">Go</a> for the plugin code.
The <a href="https://github.com/apache/arrow/tree/master/go">Go Arrow implementation</a> happens to be one of the few that is written natively (rather than implementing as a binding to libarrow) so I gave that a shot.</p>

<h2 id="making-a-new-xcode-project">Making a New XCode Project</h2>

<p>To start out, I wasn’t able to figure out how to make XCode create a new QuickLook plugin from scratch so I ended up adapting from <a href="https://github.com/toland/qlmarkdown">QLMarkdown</a>.
The important files seemed to be:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">main.c</code>: Entrypoint for the plugin. Mostly boilerplate aside from the GUID.</li>
  <li><code class="language-plaintext highlighter-rouge">GeneratePreviewForURL.m</code>: Definition and implementation of code for generating our QuickLook preview. This is what I cared about most.</li>
  <li><code class="language-plaintext highlighter-rouge">GenerateThumbnailForURL.m</code>: Definition and implementation of code for generating thumbnails for our files. Not used here.</li>
</ul>

<p>The core bit on the XCode side is essentially the implementation in <code class="language-plaintext highlighter-rouge">GeneratePreviewForURL.h</code> which implememnts what looks like a fairly reasonable interface in order to get data back to macOS for displaying the preview:</p>

<div class="language-objectivec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">OSStatus</span> <span class="nf">GeneratePreviewForURL</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">thisInterface</span><span class="p">,</span> <span class="n">QLPreviewRequestRef</span> <span class="n">preview</span><span class="p">,</span> <span class="n">CFURLRef</span> <span class="n">url</span><span class="p">,</span>
                               <span class="n">CFStringRef</span> <span class="n">contentTypeUTI</span><span class="p">,</span> <span class="n">CFDictionaryRef</span> <span class="n">options</span><span class="p">)</span> <span class="p">{</span>

  <span class="n">NSString</span> <span class="o">*</span><span class="n">content</span> <span class="o">=</span> <span class="n">MyFun</span><span class="p">((</span><span class="n">__bridge</span> <span class="n">NSURL</span> <span class="o">*</span><span class="p">)</span><span class="n">url</span><span class="p">);</span>

  <span class="n">CFDictionaryRef</span> <span class="n">previewProperties</span> <span class="o">=</span> <span class="p">(</span><span class="n">__bridge</span> <span class="n">CFDictionaryRef</span><span class="p">)</span> <span class="p">@{</span>
    <span class="p">(</span><span class="n">__bridge</span> <span class="n">NSString</span> <span class="o">*</span><span class="p">)</span><span class="n">kQLPreviewPropertyTextEncodingNameKey</span> <span class="o">:</span> <span class="s">@"UTF-8"</span><span class="p">,</span>
    <span class="p">(</span><span class="n">__bridge</span> <span class="n">NSString</span> <span class="o">*</span><span class="p">)</span><span class="n">kQLPreviewPropertyMIMETypeKey</span> <span class="o">:</span> <span class="s">@"text/html"</span><span class="p">,</span>
  <span class="p">};</span>

  <span class="n">QLPreviewRequestSetDataRepresentation</span><span class="p">(</span><span class="n">preview</span><span class="p">,</span> <span class="p">(</span><span class="n">__bridge</span> <span class="n">CFDataRef</span><span class="p">)[</span><span class="n">content</span> <span class="nf">dataUsingEncoding</span><span class="p">:</span><span class="n">NSUTF8StringEncoding</span><span class="p">],</span>
                                        <span class="n">kUTTypeHTML</span><span class="p">,</span> <span class="n">previewProperties</span><span class="p">);</span>

  <span class="k">return</span> <span class="n">noErr</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So basically I needed to write a function that:</p>

<ol>
  <li>Takes a filepath (URL)</li>
  <li>Returns a string</li>
</ol>

<h2 id="writing-the-go-portion">Writing the Go Portion</h2>

<p>Thanks to <a href="https://go.dev/blog/cgo">CGo</a>, we can easily work with Go and C code at the same time which (I think) is why all of this works so well.</p>

<p>A basic skeleton for the Go code looks like this:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">internal</span>

<span class="k">import</span> <span class="p">(</span> <span class="c">//... your packages here )</span>

<span class="k">import</span> <span class="s">"C"</span>

<span class="c">//export MyFun</span>
<span class="k">func</span> <span class="n">MyFun</span><span class="p">(</span><span class="n">cpath</span> <span class="o">*</span><span class="n">C</span><span class="o">.</span><span class="n">char</span><span class="p">)</span> <span class="p">(</span><span class="n">code</span> <span class="n">C</span><span class="o">.</span><span class="kt">int</span><span class="p">,</span> <span class="n">outData</span> <span class="n">unsafe</span><span class="o">.</span><span class="n">Pointer</span><span class="p">,</span> <span class="n">outLen</span> <span class="n">C</span><span class="o">.</span><span class="n">long</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">path</span> <span class="o">:=</span> <span class="n">C</span><span class="o">.</span><span class="n">GoString</span><span class="p">(</span><span class="n">cpath</span><span class="p">)</span>

	<span class="k">var</span> <span class="n">buf</span> <span class="n">bytes</span><span class="o">.</span><span class="n">Buffer</span>

    <span class="c">// Now just write data into `buf`</span>

	<span class="k">return</span> <span class="m">0</span><span class="p">,</span> <span class="n">C</span><span class="o">.</span><span class="n">CBytes</span><span class="p">(</span><span class="n">buf</span><span class="o">.</span><span class="n">Bytes</span><span class="p">()),</span> <span class="n">C</span><span class="o">.</span><span class="n">long</span><span class="p">(</span><span class="n">buf</span><span class="o">.</span><span class="n">Len</span><span class="p">())</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A couple of things to note:</p>

<ol>
  <li>The <code class="language-plaintext highlighter-rouge">import "C"</code> is key here</li>
  <li>I’m not sure if the <code class="language-plaintext highlighter-rouge">//export MyFun</code> is required here but I left it in</li>
  <li>The function you write just needs to write into <code class="language-plaintext highlighter-rouge">buf</code> which is pretty straightforward in Go</li>
</ol>

<p>Last, to compile our Go module into something we can tell XCode to link against, we do something I’d never done before with Go:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>go build <span class="nt">-buildmode</span><span class="o">=</span>c-archive <span class="nt">-o</span> internal.a ./internal
</code></pre></div></div>

<p>The above produces <code class="language-plaintext highlighter-rouge">internal.a</code> which is critical for the next step.</p>

<h2 id="bringing-both-sides-together">Bringing Both Sides Together</h2>

<p>To tell XCode to compile our Objective-C code and to link against <code class="language-plaintext highlighter-rouge">internal.a</code>, we need to add it under XCode under Build Phases &gt; Link Binary With Libraries.</p>

<p>Depending on what Go code you end up writing, you may need to also add various <code class="language-plaintext highlighter-rouge">.frameworks</code> until linking succeeds.
One surprising thing I ran into was that using Go’s <code class="language-plaintext highlighter-rouge">template/html</code> package required <code class="language-plaintext highlighter-rouge">Security.framework</code>.
In total, I ended up linking against:</p>

<p><img src="/assets/quicklook-go/linking.png" alt="Screenshot of Apple XCode showing a user interface of a list of items, headed by the text &quot;Link Binary With Libraries (8 Items)&quot;" /></p>

<h2 id="wrapping-up">Wrapping Up</h2>

<p>Once built, you can move the result into <code class="language-plaintext highlighter-rouge">~/Library/QuickLook</code>.
You may have to run <code class="language-plaintext highlighter-rouge">qlmanage -r</code> and even preview other files to get the new previews to be picked up.
Overally, this is a bit finicky and I wish double-clicking on a <code class="language-plaintext highlighter-rouge">*.qlgenerator</code> file just prompted you to install it and handled caches for you.</p>

<p>In the end, my preview for Parquet ended up looking like this:</p>

<p><img src="/assets/quicklook-go/qlarrow-example.png" alt="Screenshot of a QuickLook preview dialog showing a summary of a file named orders_0.1.parquet" /></p>

<p>I put the full source code for my plugin at <a href="https://github.com/amoeba/QLArrow">QLArrow</a></p>]]></content><author><name></name></author><category term="til" /><category term="macos" /><category term="quicklook" /><category term="golang" /><category term="apache-arrow" /><category term="parquet" /><summary type="html"><![CDATA[I recently wanted to write a QuickLook plugin for Apache Parquet because I’m starting to use it more and more. There are some neat third-party plugins out there like QLMarkdown and QLStephen so I sat down to figure out how to write my own.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/quicklook-go.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/quicklook-go.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Flatgeobuf</title><link href="https://brycemecum.com/2022/04/04/flatgeobuf/" rel="alternate" type="text/html" title="Flatgeobuf" /><published>2022-04-04T00:00:00+00:00</published><updated>2022-04-04T00:00:00+00:00</updated><id>https://brycemecum.com/2022/04/04/flatgeobuf</id><content type="html" xml:base="https://brycemecum.com/2022/04/04/flatgeobuf/"><![CDATA[<p>I recently came across <a href="https://flatgeobuf.org/">Flatgeobuf</a> and it looks like a really neat project.
An <a href="https://observablehq.com/@bjornharrtell/streaming-flatgeobuf">Observable Notebook</a> by its creator shows an example of progressively rendering polygons for all counties in the US and it got me thinking about how to apply it elsewhere.</p>

<p><img src="/assets/flatgeobuf/counties-animation.gif" alt="animation showing the counties of the United States being drawn as polygons in a seemingly random order" /></p>

<p>The key of that <a href="https://observablehq.com/@bjornharrtell/streaming-flatgeobuf">demo</a> — and something that takes advantage of how the Flatgeobuf format is designed — is the <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream">ReadableStream</a>.
Instead of having to read the entire file (14.1MB in the above demo) before starting to render our map, we can begin rendering essentially as soon as the first feature is read over the stream and continue rendering, feature-by-feature, as more features are loaded.</p>

<p>That’s accomplished with the following async code:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">response</span> <span class="o">=</span> <span class="k">await</span> <span class="nf">fetch</span><span class="p">(</span><span class="dl">'</span><span class="s1">https://flatgeobuf.org/test/data/UScounties.fgb</span><span class="dl">'</span><span class="p">)</span>

<span class="k">for</span> <span class="k">await </span><span class="p">(</span><span class="kd">let</span> <span class="nx">feature</span> <span class="k">of</span> <span class="nx">flatgeobuf</span><span class="p">.</span><span class="nf">deserialize</span><span class="p">(</span><span class="nx">response</span><span class="p">.</span><span class="nx">body</span><span class="p">))</span> <span class="p">{</span>
  <span class="c1">// Do stuff with `feature` here</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The above snippet relies on your browser’s support for streaming which is surprisingly a sorta new thing.
Streams have been around in non-web programming languages for ages and even Node.js made them a key language feature but support in browsers more or less <a href="https://web.dev/fetch-upload-streaming/">just</a> <a href="https://css-tricks.com/web-streams-everywhere-and-fetch-for-node-js/">landed</a> and I somehow missed it.</p>

<p>All of this got me thinking about a problem we ran into at $DAYJOB where we wanted to be able to show hundreds of thousands of points on a 2D map or 3D globe, all on the client-side.
Since our data doesn’t change very often, we could just build custom 2D/3D tiles and send those but I wondered how fast doing this with a Flatgeobuf would be.</p>

<p>To test this out, I wrote a small Python scripts to generate a GeoJSON FeatureCollection of a million random points:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">collections</span> <span class="kn">import</span> <span class="n">OrderedDict</span>
<span class="kn">import</span> <span class="n">json</span>
<span class="kn">from</span> <span class="n">numpy.random</span> <span class="kn">import</span> <span class="n">default_rng</span>

<span class="n">rng</span> <span class="o">=</span> <span class="nf">default_rng</span><span class="p">()</span>

<span class="n">n</span> <span class="o">=</span> <span class="mi">1000000</span>
<span class="n">lons</span> <span class="o">=</span> <span class="n">rng</span><span class="p">.</span><span class="nf">uniform</span><span class="p">(</span><span class="o">-</span><span class="mi">180</span><span class="p">,</span> <span class="mi">180</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="n">lats</span> <span class="o">=</span> <span class="n">rng</span><span class="p">.</span><span class="nf">uniform</span><span class="p">(</span><span class="o">-</span><span class="mi">90</span><span class="p">,</span> <span class="mi">90</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="n">pairs</span> <span class="o">=</span> <span class="nf">zip</span><span class="p">(</span><span class="n">lons</span><span class="p">,</span> <span class="n">lats</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">create_feature</span><span class="p">(</span><span class="n">lon</span><span class="p">,</span> <span class="n">lat</span><span class="p">):</span>
    <span class="k">return</span> <span class="p">{</span>
        <span class="sh">"</span><span class="s">type</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">Feature</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">properties</span><span class="sh">"</span><span class="p">:</span> <span class="nc">OrderedDict</span><span class="p">(),</span>
        <span class="sh">"</span><span class="s">geometry</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span><span class="sh">"</span><span class="s">type</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">Point</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">coordinates</span><span class="sh">"</span><span class="p">:</span> <span class="p">(</span><span class="n">lon</span><span class="p">,</span> <span class="n">lat</span><span class="p">)},</span>
    <span class="p">}</span>

<span class="n">features</span> <span class="o">=</span> <span class="p">[</span><span class="nf">create_feature</span><span class="p">(</span><span class="n">pair</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">pair</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="n">pairs</span><span class="p">]</span>

<span class="n">geodata</span> <span class="o">=</span> <span class="p">{</span>
    <span class="sh">"</span><span class="s">type</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">FeatureCollection</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">features</span><span class="sh">"</span><span class="p">:</span> <span class="n">features</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="sh">"</span><span class="s">./output.geojson</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">w</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="nf">dumps</span><span class="p">(</span><span class="n">geodata</span><span class="p">))</span>
</code></pre></div></div>

<p>And then ran that through <code class="language-plaintext highlighter-rouge">ogr2ogr</code> to generate a Flatgeobuf file:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ogr2ogr <span class="nt">-f</span> FlatGeobuf output.fgb output.geojson
</code></pre></div></div>

<p>While Flatgeobuf may be new to me, it seems like it already has pretty broad support in tools such as <a href="https://gdal.org/">GDAL</a>, <a href="https://fiona.readthedocs.io/">Fiona</a>, <a href="https://qgis.org/">QGIS</a>, <a href="https://flatgeobuf.org/#supported-applications--libraries">and more</a>.</p>

<p>The above <code class="language-plaintext highlighter-rouge">ogr2ogr</code> command created a 106.7MB Flatgeobuf file.
I slapped together a quick demo using JS similar to the above but swapped out <a href="https://d3js.org/">d3</a> for the <a href="https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API">Canvas API</a> and promptly ran into my first hiccup with the format: My points didn’t start drawing until about 40MB of the file were streamed which is not what I was going for.</p>

<p>I asked about this on the developers’ Discord and they quickly set me straight: Flatgeobuf’s index is stored at the beginning of the file and uses about 40 bytes per node.
So for my 1,000,000 points, roughly 40MB is just for the index.</p>

<p><img src="/assets/flatgeobuf/flatgeobuf-format.png" alt="diagram of the Flatgeobuf format showing a sub-divided rectangle composed of four sub-rectangles labeled MB (Magic Bytes), H (Header), I (optional index), and DATA" /></p>

<p>Since my demo didn’t need the spatial index (I didn’t need to subset by a bounding box, which is another feature of the format), I could omit the index by passing the <code class="language-plaintext highlighter-rouge">-lco SPATIAL_INDEX=NO</code> flag to <code class="language-plaintext highlighter-rouge">ogr2ogr</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ogr2ogr <span class="nt">-f</span> FlatGeobuf <span class="nt">-lco</span> <span class="nv">SPATIAL_INDEX</span><span class="o">=</span>NO output.fgb output.geojson
</code></pre></div></div>

<p>The resulting file ended up coming in at 64MB which matches the above estimate.
The impact of not having the index is that drawing happens in the same order as the features were written out in the GeoJSON file, rather than following a <a href="https://en.wikipedia.org/wiki/Hilbert_R-tree">Hilbert R-Tree</a>
For my use case, this is totally fine.</p>

<p>With this smaller, index-less file, the drawing begins immediately.
And on my 10Mbit home connection, it actually takes about 20 seconds to stream which is (accidentally) a great way to showcase how well this works:</p>

<p><img src="/assets/flatgeobuf/points-animation.gif" alt="animation showing one million random black points being drawn on a white backdrop, eventually turning into a mostly black rectangle as the points fill in" /></p>

<p>Check out the demo for yourself at <a href="https://amoeba-flatgeobuf-experiments.netlify.app">https://amoeba-flatgeobuf-experiments.netlify.app</a> or check out <a href="https://github.com/amoeba/flatgeobuf-experiments">the code</a>.
The demo only really works on Chrome, probably due to differences in Canvas API implementations.
Safari seems to delay doing any painting until all draw commands are done and Firefox seems to batch them.</p>]]></content><author><name></name></author><category term="software" /><category term="spatial" /><category term="flatgeobuf" /><category term="python" /><category term="geojson" /><category term="gdal" /><summary type="html"><![CDATA[I recently came across Flatgeobuf and it looks like a really neat project. An Observable Notebook by its creator shows an example of progressively rendering polygons for all counties in the US and it got me thinking about how to apply it elsewhere.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/flatgeobuf.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/flatgeobuf.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Testing R API Packages</title><link href="https://brycemecum.com/2020/08/05/testing-r-api-packages/" rel="alternate" type="text/html" title="Testing R API Packages" /><published>2020-08-05T00:00:00+00:00</published><updated>2020-08-05T00:00:00+00:00</updated><id>https://brycemecum.com/2020/08/05/testing-r-api-packages</id><content type="html" xml:base="https://brycemecum.com/2020/08/05/testing-r-api-packages/"><![CDATA[<p>I recently needed to test an <a href="https://github.com/nceas/rt">R package</a> at <a href="https://nceas.ucsb.edu">work</a> destined for <a href="https://cran.r-project.org/">CRAN</a> that wraps an <a href="https://bestpractical.com/request-tracker">API</a> and ran into a situation where:</p>

<ol>
  <li>I wanted only <em>unit</em> tests to run when CRAN checks the package. A package with tests that run on CRAN and depend on web services such as APIs are bound to cause your package to fail CRAN’s checks eventually which is a pain for both CRAN and you.</li>
  <li>I wanted to check the package across a variety of platforms and R versions in a typical build matrix fashion.</li>
  <li>I wanted to run a full <em>integration</em> test suite somewhere other than my machine in order to ensure the integration tests work in a clean environment.</li>
</ol>

<p>I settled on <a href="https://github.com/features/actions">GitHub Actions</a> because it’s integrated with GitHub itself (which is really nice) and there are already great resources such as <a href="https://www.jimhester.com/talk/2020-rsc-github-actions/">Jim Hester’s talk</a> and helpful utilities such as <a href="https://usethis.r-lib.org/reference/github_actions.html">usethis:use_github_actions()</a> which makes it easy to get started.</p>

<p>The setup requires creating two GitHub Actions <a href="https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions">workflows</a>:</p>

<ol>
  <li>One that runs <code class="language-plaintext highlighter-rouge">R CMD CHECK</code> across a build matrix of platforms and R versions to ensure the package works for others. This runs just <em>unit</em> tests (i.e., those that don’t depend on external access to an API).</li>
  <li>Another that runs the full <em>integration</em> test suite. This will use a <a href="https://docker.com">Docker</a> container to spin up a fresh instance of the API I’m testing which is super easy with GitHub Actions.</li>
</ol>

<p>Before setting up both workflows, I needed a way to skip a test if it’s an integration test (i.e., depended on having access to the API).
I use <code class="language-plaintext highlighter-rouge">testthat</code> for my tests so I defined a helper in <code class="language-plaintext highlighter-rouge">./tests/setup-rt.R</code> (<code class="language-plaintext highlighter-rouge">rt</code> is my package name here) which makes my helper available to all tests:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Skip helper to control whether integration tests are run or not</span><span class="w">
</span><span class="n">skip_unless_integration</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">Sys.getenv</span><span class="p">(</span><span class="s2">"RT_INTEGRATION"</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">skip</span><span class="p">(</span><span class="s2">"Skipping integration test. Set RT_INTEGRATION to TRUE to run all tests."</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This is the basis for a convention in my package where the full test suite is only run when the environmental variable <code class="language-plaintext highlighter-rouge">RT_INTEGRATION</code> is set to <code class="language-plaintext highlighter-rouge">TRUE</code> which I can control with GitHub Actions. With this setup, any test which requires access to the API gets skipped both on CRAN and when running the test suite locally when I prepend the following two lines to a test:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">test_that</span><span class="p">(</span><span class="s2">"we can get properties of a ticket"</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">testthat</span><span class="o">::</span><span class="n">skip_on_cran</span><span class="p">()</span><span class="w">
  </span><span class="n">skip_unless_integration</span><span class="p">()</span><span class="w">

  </span><span class="c1"># The rest of the test</span><span class="w">
</span><span class="p">})</span><span class="w">
</span></code></pre></div></div>

<p>With this test helper and <code class="language-plaintext highlighter-rouge">testthat::skip_on_cran()</code>, I can control which tests are run on CRAN and which tests are run when I have GitHub Actions run the full test suite depending on whether I include both, one, or none of them.</p>

<p>Now we need to pair this with the two workflows I mentioned above.
These go in a <code class="language-plaintext highlighter-rouge">.github</code> folder at the top level of the package:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.github
└── workflows
    ├── ci.yml      # Build matrix
    └── tests.yml   # Integration tests

1 directory, 2 files
</code></pre></div></div>

<p>The first, <code class="language-plaintext highlighter-rouge">ci.yml</code> is a workflow that effectively runs <code class="language-plaintext highlighter-rouge">R CMD CHECK</code> on a variety of platforms and R versions (a build matrix):</p>

<div class="language-yml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">on</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">push</span><span class="pi">,</span> <span class="nv">pull_request</span><span class="pi">]</span>

<span class="na">name</span><span class="pi">:</span> <span class="s">CI</span>

<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">CI</span><span class="pi">:</span>
    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">${{ matrix.config.os }}</span>

    <span class="na">strategy</span><span class="pi">:</span>
      <span class="na">fail-fast</span><span class="pi">:</span> <span class="kc">false</span>
      <span class="na">matrix</span><span class="pi">:</span>
        <span class="na">config</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="pi">{</span> <span class="nv">os</span><span class="pi">:</span> <span class="nv">windows-latest</span><span class="pi">,</span> <span class="nv">r</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.6"</span><span class="pi">,</span> <span class="nv">args</span><span class="pi">:</span> <span class="s2">"</span><span class="s">--no-manual"</span> <span class="pi">}</span>
          <span class="pi">-</span> <span class="pi">{</span> <span class="nv">os</span><span class="pi">:</span> <span class="nv">windows-latest</span><span class="pi">,</span> <span class="nv">r</span><span class="pi">:</span> <span class="s2">"</span><span class="s">4.0"</span><span class="pi">,</span> <span class="nv">args</span><span class="pi">:</span> <span class="s2">"</span><span class="s">--no-manual"</span> <span class="pi">}</span>
          <span class="pi">-</span> <span class="pi">{</span> <span class="nv">os</span><span class="pi">:</span> <span class="nv">macOS-latest</span><span class="pi">,</span> <span class="nv">r</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.6"</span> <span class="pi">}</span>
          <span class="pi">-</span> <span class="pi">{</span> <span class="nv">os</span><span class="pi">:</span> <span class="nv">macOS-latest</span><span class="pi">,</span> <span class="nv">r</span><span class="pi">:</span> <span class="s2">"</span><span class="s">4.0"</span> <span class="pi">}</span>
          <span class="pi">-</span> <span class="pi">{</span> <span class="nv">os</span><span class="pi">:</span> <span class="nv">macOS-latest</span><span class="pi">,</span> <span class="nv">r</span><span class="pi">:</span> <span class="s2">"</span><span class="s">devel"</span><span class="pi">,</span> <span class="nv">args</span><span class="pi">:</span> <span class="s2">"</span><span class="s">--no-manual"</span> <span class="pi">}</span>
          <span class="pi">-</span> <span class="pi">{</span> <span class="nv">os</span><span class="pi">:</span> <span class="nv">ubuntu-18.04</span><span class="pi">,</span> <span class="nv">r</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.5"</span><span class="pi">,</span> <span class="nv">args</span><span class="pi">:</span> <span class="s2">"</span><span class="s">--no-manual"</span> <span class="pi">}</span>
          <span class="pi">-</span> <span class="pi">{</span> <span class="nv">os</span><span class="pi">:</span> <span class="nv">ubuntu-18.04</span><span class="pi">,</span> <span class="nv">r</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.6"</span><span class="pi">,</span> <span class="nv">args</span><span class="pi">:</span> <span class="s2">"</span><span class="s">--no-manual"</span> <span class="pi">}</span>
          <span class="pi">-</span> <span class="pi">{</span> <span class="nv">os</span><span class="pi">:</span> <span class="nv">ubuntu-18.04</span><span class="pi">,</span> <span class="nv">r</span><span class="pi">:</span> <span class="s2">"</span><span class="s">4.0"</span><span class="pi">,</span> <span class="nv">args</span><span class="pi">:</span> <span class="s2">"</span><span class="s">--no-manual"</span> <span class="pi">}</span>
    <span class="na">env</span><span class="pi">:</span>
      <span class="na">R_REMOTES_NO_ERRORS_FROM_WARNINGS</span><span class="pi">:</span> <span class="kc">true</span>

    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v1</span>

      <span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-r@master</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">r-version</span><span class="pi">:</span> <span class="s">${{ matrix.config.r }}</span>

      <span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-pandoc@master</span>

      <span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-tinytex@master</span>
        <span class="na">if</span><span class="pi">:</span> <span class="s">contains(matrix.config.args, 'no-manual') == </span><span class="kc">false</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Cache R packages</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/cache@v1</span>
        <span class="na">if</span><span class="pi">:</span> <span class="s">runner.os != 'Windows'</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">path</span><span class="pi">:</span> <span class="s">${{ env.R_LIBS_USER }}</span>
          <span class="na">key</span><span class="pi">:</span> <span class="s">${{ runner.os }}-r-${{ matrix.config.r }}-${{ hashFiles('**/DESCRIPTION') }}</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install system dependencies</span>
        <span class="na">if</span><span class="pi">:</span> <span class="s">runner.os == 'Linux'</span>
        <span class="na">env</span><span class="pi">:</span>
          <span class="na">RHUB_PLATFORM</span><span class="pi">:</span> <span class="s">linux-x86_64-ubuntu-gcc</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">Rscript -e "install.packages('remotes')" -e "remotes::install_github('r-hub/sysreqs')"</span>
          <span class="s">sysreqs=$(Rscript -e "cat(sysreqs::sysreq_commands('DESCRIPTION'))")</span>
          <span class="s">sudo -s eval "$sysreqs"</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install dependencies</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">install.packages("remotes")</span>
          <span class="s">remotes::install_deps(dependencies = TRUE)</span>
          <span class="s">remotes::install_cran('rcmdcheck')</span>
        <span class="na">shell</span><span class="pi">:</span> <span class="s">Rscript {0}</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Check</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e "rcmdcheck::rcmdcheck(args = '${{ matrix.config.args }}', error_on = 'warning', check_dir = 'check')"</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Upload check results</span>
        <span class="na">if</span><span class="pi">:</span> <span class="s">failure()</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/upload-artifact@master</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">name</span><span class="pi">:</span> <span class="s">${{ runner.os }}-r${{ matrix.config.r }}-results</span>
          <span class="na">path</span><span class="pi">:</span> <span class="s">check</span>
</code></pre></div></div>

<p>The second, <code class="language-plaintext highlighter-rouge">tests.yml</code> runs the full test suite, which includes integration tests:</p>

<div class="language-yml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">on</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">push</span><span class="pi">,</span> <span class="nv">pull_request</span><span class="pi">]</span>

<span class="na">name</span><span class="pi">:</span> <span class="s">Tests</span>

<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">CI</span><span class="pi">:</span>
    <span class="na">services</span><span class="pi">:</span>
      <span class="na">rt</span><span class="pi">:</span>
        <span class="na">image</span><span class="pi">:</span> <span class="s">netsandbox/request-tracker</span>
        <span class="na">ports</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="s">80:80</span>

    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-latest</span>

    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v1</span>

      <span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-r@master</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Cache R packages</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/cache@v1</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">path</span><span class="pi">:</span> <span class="s">${{ env.R_LIBS_USER }}</span>
          <span class="na">key</span><span class="pi">:</span> <span class="s">${{ hashFiles('**/DESCRIPTION') }}</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install system dependencies</span>
        <span class="na">env</span><span class="pi">:</span>
          <span class="na">RHUB_PLATFORM</span><span class="pi">:</span> <span class="s">linux-x86_64-ubuntu-gcc</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">Rscript -e "install.packages('remotes')" -e "remotes::install_github('r-hub/sysreqs')"</span>
          <span class="s">sysreqs=$(Rscript -e "cat(sysreqs::sysreq_commands('DESCRIPTION'))")</span>
          <span class="s">sudo -s eval "$sysreqs"</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install dependencies</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">install.packages("remotes")</span>
          <span class="s">remotes::install_deps(dependencies = TRUE)</span>
          <span class="s">remotes::install_cran('rcmdcheck')</span>
        <span class="na">shell</span><span class="pi">:</span> <span class="s">Rscript {0}</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Check</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e "rcmdcheck::rcmdcheck(args = \"--no-manual\", error_on = 'warning', check_dir = 'check')"</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Upload check results</span>
        <span class="na">if</span><span class="pi">:</span> <span class="s">failure()</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/upload-artifact@master</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">name</span><span class="pi">:</span> <span class="s">results</span>
          <span class="na">path</span><span class="pi">:</span> <span class="s">check</span>
</code></pre></div></div>

<p>Hopefully this pattern is useful to others.
So far, I’ve found this setup works well and the hosting all of this on GitHub Actions also works well.</p>]]></content><author><name></name></author><category term="post" /><category term="software" /><category term="r" /><category term="github" /><category term="testing" /><summary type="html"><![CDATA[I recently needed to test an R package at work destined for CRAN that wraps an API and ran into a situation where:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/testing-r-api-packages.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/testing-r-api-packages.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Modern Web</title><link href="https://brycemecum.com/2020/05/11/modern-web/" rel="alternate" type="text/html" title="Modern Web" /><published>2020-05-11T00:00:00+00:00</published><updated>2020-05-11T00:00:00+00:00</updated><id>https://brycemecum.com/2020/05/11/modern-web</id><content type="html" xml:base="https://brycemecum.com/2020/05/11/modern-web/"><![CDATA[<p>From <a href="https://macwright.org/2020/05/10/spa-fatigue.html">Second-guessing the modern web</a> by <a href="https://macwright.org">Tom MacWright</a>:</p>

<blockquote>
  <p>But the cultural tides are strong. Building a company on Django in 2020 seems like the equivalent of driving a PT Cruiser and blasting Faith Hill’s “Breathe” on a CD while your friends are listening to The Weeknd in their Teslas. Swimming against this current isn’t easy, and not in a trendy contrarian way.</p>
</blockquote>

<p>The four problematic areas Tom mentions, (bundle splitting, SSR, APIs, and data fetching) come out of his experience building really awesome things (e.g., at <a href="https://www.mapbox.com">Mapbox</a> and <a href="https://observablehq.com">Observable</a>) so he knows at least a bit about this. To build a “modern” JS application, I find myself having to write considerably more code and configure a multitude of additional libraries to get what I feel I used to get for free with, say, Ruby. And then I’ve still got problems I can’t find good solutions for.</p>

<p>(Related are <a href="http://web.archive.org/web/20200511105458/https://twitter.com/dhh">DHH</a>
‘s <a href="https://railsconf.com/2020/video/david-heinemeier-hansson-keynote-interview-with-david-heinemeier-hansson">thoughts</a> about building for the web he delivered at RailsConf last week.)</p>

<p>Again from Tom’s article, this gem is kind of hidden near the end:</p>

<blockquote>
  <p>And it’s beneficial for companies to shift computing requirements from their servers to their customers browsers: it’s a real win for reducing their spend on infrastructure.</p>
</blockquote>]]></content><author><name></name></author><category term="software" /><category term="react" /><category term="javascript" /><category term="web" /><summary type="html"><![CDATA[From Second-guessing the modern web by Tom MacWright:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/modern-web.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/modern-web.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Complexity</title><link href="https://brycemecum.com/2020/05/01/complexity/" rel="alternate" type="text/html" title="Complexity" /><published>2020-05-01T00:00:00+00:00</published><updated>2020-05-01T00:00:00+00:00</updated><id>https://brycemecum.com/2020/05/01/complexity</id><content type="html" xml:base="https://brycemecum.com/2020/05/01/complexity/"><![CDATA[<p>From <a href="https://ferd.ca/complexity-has-to-live-somewhere.html">Complexity Has to Live Somewhere</a>:</p>

<blockquote>
  <p><em>Complexity has to live somewhere.</em> If you embrace it, give it the place it deserves, design your system and organisation knowing it exists, and focus on adapting, it might just become a strength.</p>
</blockquote>

<p>Great take on managing complexity.
Instantly reminds me of Rich Hickey’s excellent talk, <a href="https://www.infoq.com/presentations/Simple-Made-Easy/">Simple Made Easy</a> where he discusses simplicity and complexity in software and how complexity does not necessarily mean something is complicated (to understand). Rich’s talk has stuck with me and I imagine this post will too. The entire post (and Rich’s talk) are well worth a read/listen.</p>]]></content><author><name></name></author><category term="software" /><summary type="html"><![CDATA[From Complexity Has to Live Somewhere:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/complexity.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/complexity.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Rasterizing Shapefiles in R</title><link href="https://brycemecum.com/2014/05/15/rasterizing-in-r/" rel="alternate" type="text/html" title="Rasterizing Shapefiles in R" /><published>2014-05-15T00:00:00+00:00</published><updated>2014-05-15T00:00:00+00:00</updated><id>https://brycemecum.com/2014/05/15/rasterizing-in-r</id><content type="html" xml:base="https://brycemecum.com/2014/05/15/rasterizing-in-r/"><![CDATA[<p>I recently needed to convert a
<a href="https://en.wikipedia.org/wiki/Shapefile">shapefile</a>
 to a
<a href="https://en.wikipedia.org/wiki/Raster_graphics">raster</a>
 for use in
another package and wanted to share my steps here.</p>

<p>For this demonstration, I started with the <a href="https://www.naturalearthdata.com/">Natural Earth
Data</a>
 <a href="https://www.naturalearthdata.com/downloads/10m-physical-vectors/">1:10m Physical Vectors
Land</a></p>

<p>shapefile and will convert it to a raster using the <code class="language-plaintext highlighter-rouge">raster</code> package.</p>

<p>A bit of searching around on the web led me to <a href="https://amywhiteheadresearch.wordpress.com/">Amy
Whitehead’s</a>
 page on
<a href="https://amywhiteheadresearch.wordpress.com/2014/05/01/shp2raster/">Converting shapefiles to rasters in
R</a>
.
The code listed there wasn’t quite what I needed but gave me the head
start on figuring out what I needed to do.</p>

<h2 id="load-required-packages">Load required packages</h2>

<p>The <code class="language-plaintext highlighter-rouge">maptools</code> package is used to import the shapefile to a
<code class="language-plaintext highlighter-rouge">SpatialPolygonsDataFrame</code> and the <code class="language-plaintext highlighter-rouge">raster</code> package is for rasterizing
the shapefile.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library(maptools)

## Loading required package: sp

## Checking rgeos availability: FALSE
##      Note: when rgeos is not available, polygon geometry     computations in maptools depend on gpclib,
##      which has a restricted licence. It is disabled by default;
##      to enable gpclib, type gpclibPermit()

library(raster)
</code></pre></div></div>

<h2 id="load-the-shapefile-well-be-rasterizing">Load the shapefile we’ll be rasterizing</h2>

<p>Use the <code class="language-plaintext highlighter-rouge">readShapePoly</code> function from package <code class="language-plaintext highlighter-rouge">maptools</code> to read our
shapefile in as a <code class="language-plaintext highlighter-rouge">SpatialPolygonsDataFrame</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Retrieved from http//www.naturalearthdata.com/download/110m/physical/ne_110m_land.zip
shape_url &lt;- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/physical/ne_110m_land.zip"
shape_folder &lt;- tempdir()
shape_path &lt;- tempfile(tmpdir = shape_folder, fileext = ".zip")
download.file(shape_url, shape_path)
unzip(shape_path, exdir = shape_folder)
dir(shape_folder)

##  [1] "devtools4bf4152fc28f"
##  [2] "devtools4bf431284c83"
##  [3] "devtools4bf4334c9928"
##  [4] "devtools4bf4355aa26c"
##  [5] "devtools4bf46278c6fd"
##  [6] "devtools4bf468d5df12"
##  [7] "devtools4bf47558b59"
##  [8] "devtools4bf48790d1c"
##  [9] "downloaded_packages"
## [10] "file4bf429c8b448"
## [11] "file4bf43b4c949f"
## [12] "file4bf4411676aa"
## [13] "file4bf4434bad9c"
## [14] "file4bf45ea45e1b.zip"
## [15] "file4bf4617ac199"
## [16] "file4bf4d7ffb4b"
## [17] "libloc_213_722ad5f9e07c7fe1.rds"
## [18] "ne_110m_land.cpg"
## [19] "ne_110m_land.dbf"
## [20] "ne_110m_land.prj"
## [21] "ne_110m_land.README.html"
## [22] "ne_110m_land.shp"
## [23] "ne_110m_land.shx"
## [24] "ne_110m_land.VERSION.txt"
## [25] "repos_https%3A%2F%2Fcran.rstudio.com%2Fbin%2Fmacosx%2Fel-capitan%2Fcontrib%2F3.5.rds"
## [26] "repos_https%3A%2F%2Fcran.rstudio.com%2Fsrc%2Fcontrib.rds"
## [27] "Rprofile-devtools"
## [28] "rs-graphics-0cee3d3c-f7dd-4e3f-b7e7-548de668efab"

land &lt;- readShapePoly(file.path(shape_folder, "ne_110m_land.shp"))

## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
</code></pre></div></div>

<h2 id="create-a-blank-raster">Create a blank raster</h2>

<p>Before we can rasterize the shapefile, we need to create a blank raster.
We specifiy the number of rows and columns to control the spatial
resolution of our raster. We also have to specify the spatial extent,
which will determine how much of the area of our shapefile we’re
rasterizing. Here, I use the <code class="language-plaintext highlighter-rouge">extent</code> function from the <code class="language-plaintext highlighter-rouge">raster</code> package
to grab the extent of the shapefile and pass that to the <code class="language-plaintext highlighter-rouge">raster</code>
function. You can also specify extent manually.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>blank_raster &lt;- raster(nrow = 100, ncol = 100, extent(land))
</code></pre></div></div>

<p>If we were to plot this now, we would get an error because, by default,
new rasters are created without any values. Rasters are fundamentally
just like a 2-dimensional matrix (though they are stored as 1-d
vectors). To get at this 2-d matrix, rasters have a
<a href="https://stat.ethz.ch/R-manual/R-devel/library/methods/html/slot.html">slot</a></p>

<p>called <code class="language-plaintext highlighter-rouge">data</code>, which you can access by calling <code class="language-plaintext highlighter-rouge">blank_raster@data</code>. You
can get (and set) the values of the raster using the <code class="language-plaintext highlighter-rouge">values</code> function.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>values(blank_raster) &lt;- 1
plot(blank_raster)
</code></pre></div></div>

<p><img src="/assets/rasterizing-in-r/blank_raster-1.png" alt="" /></p>

<p>Above, we’ve set all those values to 1 and the result is just a raster
of one value (color). The values inside the raster are stored in a
vector of length <code class="language-plaintext highlighter-rouge">nrow * ncol</code>, so let’s set the values equal to the
sequence of numbers from <code class="language-plaintext highlighter-rouge">1:nrow*ncol</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>values(blank_raster) &lt;- 1:(100*100)
plot(blank_raster, col=rainbow(50))
</code></pre></div></div>

<p><img src="/assets/rasterizing-in-r/blank_raster_rainbow-1.png" alt="" /></p>

<p>Pretty cool! So we can see that raster’s values are stored going from
the top-left to the bottom-right.</p>

<h2 id="rasterize-the-shapefile">Rasterize the shapefile</h2>

<p>The first line here is pretty straightforward but the second line
requires some explanation. The <code class="language-plaintext highlighter-rouge">raster</code> function tries to assign values
to the resulting raster cells. Cells inside any of the polygons of the
shapefile will have some value, while cells not included in any of the
polygons of the shapefile will have the value <code class="language-plaintext highlighter-rouge">NA</code>. For some shapefiles,
values are picked up from the attributes embedded in the shapefile and
will result in variation in values (colors). In my case, I want to force
all cells corresponding to land to be equal to 1 and leave the rest
<code class="language-plaintext highlighter-rouge">NA</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>land_raster &lt;- rasterize(land, blank_raster)
land_raster[!(is.na(land_raster))] &lt;- 1
</code></pre></div></div>

<h2 id="plot-the-result">Plot the result</h2>

<p>Rasters can be plotted directly with the <code class="language-plaintext highlighter-rouge">plot</code> function because the
<code class="language-plaintext highlighter-rouge">raster</code> package implements its own <code class="language-plaintext highlighter-rouge">plot</code> method. Here, we also add the
shapefile back on top of the raster to see how we did.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plot(land_raster, legend=FALSE)
plot(land, add=TRUE)
</code></pre></div></div>

<p><img src="/assets/rasterizing-in-r/land_raster-1.png" alt="" /></p>

<p>Looks pretty good! The filled-in yellow area is the new raster. Let’s
zoom in to see what’s going on.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plot(land_raster, legend=FALSE, xlim=c(-170, -120), ylim=c(30, 65))
plot(land, add=TRUE,  xlim=c(-160, -120), ylim=c(30, 90))
</code></pre></div></div>

<p><img src="/assets/rasterizing-in-r/land_raster_zoomed-1.png" alt="" /></p>

<p>Not great. The raster cells are roughly outlining the land but end up
being pretty poor approximations in places. Clearly, we might like to
improve the spatial resolution of our raster but this is a pretty good
start.</p>

<p>Let’s the spatial resolution up to see if we can do better.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>blank_raster &lt;- raster(nrow = 1000, ncol = 2000, extent(land))
land_raster &lt;- rasterize(land, blank_raster)
land_raster[!(is.na(land_raster))] &lt;- 1
plot(land_raster, legend=FALSE, xlim=c(-170, -120), ylim=c(30, 65))
plot(land, add=TRUE,  xlim=c(-160, -120), ylim=c(30, 90))
</code></pre></div></div>

<p><img src="/assets/rasterizing-in-r/land_raster_zoomed_hires-1.png" alt="" /></p>

<p>Much better!</p>

<h2 id="bonus-effect-of-different-raster-resolutions">Bonus: Effect of different raster resolutions</h2>

<p>In the above example, I used a 100x100 grid for the raster. Below, I
test 20x20, 100x100, 500x500, and 1000x1000 grids together to explore
the effect of that part of the process. What resolution do we need to
adequately describe the shape of the land?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>layout(matrix(1:4, ncol=2, byrow=TRUE))

resolutions &lt;- c(20, 100, 500, 1000)

for(r in resolutions)
{
  blank_raster &lt;- raster(nrow = r, ncol = r, extent(land))
  land_raster &lt;- rasterize(land, blank_raster)
  land_raster[!(is.na(land_raster))] &lt;- 1

  plot(land_raster, legend=FALSE, xlim=c(-170, -120), ylim=c(30, 65), main=paste0("Resolution: ", r, "x", r))
  plot(land, add=TRUE,  xlim=c(-160, -120), ylim=c(30, 90))
}
</code></pre></div></div>

<p><img src="/assets/rasterizing-in-r/land_raster_grid-1.png" alt="" /></p>

<p>You can see that, by 500x500, we’re getting a raster that looks pretty
close to the underlying shapefile.</p>]]></content><author><name></name></author><category term="post" /><category term="r" /><category term="maps" /><category term="raster" /><summary type="html"><![CDATA[I recently needed to convert a shapefile to a raster for use in another package and wanted to share my steps here.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/rasterizing-in-r.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/rasterizing-in-r.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Working with NetCDF files in R</title><link href="https://brycemecum.com/2014/02/18/netcdf-in-r/" rel="alternate" type="text/html" title="Working with NetCDF files in R" /><published>2014-02-18T00:00:00+00:00</published><updated>2014-02-18T00:00:00+00:00</updated><id>https://brycemecum.com/2014/02/18/netcdf-in-r</id><content type="html" xml:base="https://brycemecum.com/2014/02/18/netcdf-in-r/"><![CDATA[<p><a href="https://en.wikipedia.org/wiki/NetCDF">NetCDF</a>
 is an open file format
commonly used to store oceanographic (and other) data such as sea
surface temperature (SST), sea level pressure (SLP), and much more. I
recently needed to work with SST data from the <a href="http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.surface.html">NCEP
Reanalysis</a>
and found that I didn’t know how to work with NetCDF files. This post
should serve as a short introduction working with NetCDF files using the
R package <code class="language-plaintext highlighter-rouge">ncdf</code>.</p>

<h2 id="step-1-acquire-the-netcdf-library">Step 1: Acquire the NetCDF library</h2>

<p>Before we can open a NetCDF file in R, we need to install the NetCDF
library on our system. I’m using a Mac running OS 10.9 and I use
<a href="https://brew.sh/"><code class="language-plaintext highlighter-rouge">homebrew</code></a>
 as my package manager.</p>

<p>If you’re using <code class="language-plaintext highlighter-rouge">homebrew</code>, install the <code class="language-plaintext highlighter-rouge">netcdf</code> library with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew install netcdf
</code></pre></div></div>

<p>If you’re on Windows, you can find pre-built binaries at the developer’s
<a href="http://www.unidata.ucar.edu/downloads/netcdf/index.jsp">download page</a>.
For other systems, consult the
<a href="http://www.unidata.ucar.edu/software/netcdf/docs/getting.html">documentation</a>.</p>

<h2 id="step-2-install-and-load-the-ncdf-package-in-r">Step 2: Install and load the <code class="language-plaintext highlighter-rouge">ncdf</code> package in R</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>install.packages("ncdf4")
</code></pre></div></div>

<p>And load it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library(ncdf4)
</code></pre></div></div>

<h2 id="step-3-load-your-netcdf-file">Step 3: Load your NetCDF file</h2>

<p>For this tutorial, I’ll be working with the <a href="http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.oisst.v2.html">NOAA Optimum Interpolation
(OI) Sea Surface Temperature (SST)
V2</a>
data series of monthly means from December 1981 – current. These data
are produced on a 1° grid for the entire globe.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sst_url &lt;- "ftp://ftp.cdc.noaa.gov/Datasets/noaa.oisst.v2/sst.mnmean.nc"
sst_path &lt;- tempfile()
download.file(sst_url, sst_path) # 53.0 MB
cdf &lt;- nc_open(sst_path)
</code></pre></div></div>

<p>I am interested in the following variables: latitude, longitude, time,
and SST:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lat &lt;- ncdf4::ncvar_get(cdf, varid="lat")
lon &lt;- ncdf4::ncvar_get(cdf, varid="lon")
time &lt;- ncdf4::ncvar_get(cdf, varid="time")
sst &lt;- ncdf4::ncvar_get(cdf, varid="sst")
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">lat</code> and <code class="language-plaintext highlighter-rouge">lon</code> variables are just vectors containing the range of
latitudes (89.5S to 89.5N) and longitudes (0.5°E to 359.5°E, starting
from the prime meridian).</p>

<p>The <code class="language-plaintext highlighter-rouge">time</code> variable is a little more complex. Let’s take a look:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>head(time)

## [1] 66443 66474 66505 66533 66564 66594
</code></pre></div></div>

<p>Those don’t look like times at all. If you’re used to working with dates
and times on computers, you may already be way ahead of me. If, however,
if you are not, I’ll explain. Dates and times are usually stored by
computers as a number of time units since we started counting time. The
time units may be milliseconds, seconds, or even months – whatever suits
the purpose. Computers commonly store dates and times as the number of
seconds since January 1, 1970 (See <a href="https://en.wikipedia.org/wiki/Unix_time">UNIX
time</a>
). In the case of our data,
time is being counted as <strong>days</strong> since January 1, 1800. This is a
little weird but it makes sense for our data. If you’re using a
different NetCDF file than me, you’ll want to consult the documentation
that goes along with the data to figure out how they’re counting time.</p>

<p>Now that we know how <code class="language-plaintext highlighter-rouge">time</code> is being stored, we can make it
human-readable:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time_d &lt;- as.Date(time, format="%j", origin=as.Date("1800-01-01"))
time_years &lt;- format(time_d, "%Y")
time_months &lt;- format(time_d, "%m")
time_year_months &lt;- format(time_d, "%Y-%m")
</code></pre></div></div>

<p>Here, I set the origin to January 1, 1800 and then convert the original
variable <code class="language-plaintext highlighter-rouge">time</code> into a vector of R <code class="language-plaintext highlighter-rouge">Date</code> objects, passing the
parameters <code class="language-plaintext highlighter-rouge">format="%j"</code> and <code class="language-plaintext highlighter-rouge">origin=as.Date("1800-01-01")</code>. The <code class="language-plaintext highlighter-rouge">time</code>
variable is now much more easier to understand:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>head(time_d)

## [1] "1981-12-01" "1982-01-01" "1982-02-01" "1982-03-01" "1982-04-01"
## [6] "1982-05-01"
</code></pre></div></div>

<p>The other variables I created, <code class="language-plaintext highlighter-rouge">time_years</code>, <code class="language-plaintext highlighter-rouge">time_months</code>,
<code class="language-plaintext highlighter-rouge">time_year_months</code> are for added utility. I can now reference SST data
for just a set of years</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time_years %in% c("1990", "1991")
</code></pre></div></div>

<p>or all SST values from June</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>time_months %in% c("06")
</code></pre></div></div>

<p>Our last, but most important variable is SST.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dim(sst)

## [1] 360 180 441
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">sst</code> is three dimensional, and is indexed by longitude, latitude, and
time (respectively). To extract a single SST value from it, we’ll need
to specify an index or range of indices for all three dimensions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sst[lon==220.5, lat==50.5, time_d==as.Date("1990-06-01")]

## [1] 10.91
</code></pre></div></div>

<p>We can use our utility variables to, for example, extract all the
observations for June from this same grid cell:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sst[lon==220.5, lat==50.5, time_months=="06"]

##  [1]  9.27 10.25  9.68  8.81  9.75  8.96  8.82  9.79 10.91 10.04  9.55
## [12] 11.01 10.55  9.46  9.39 10.21 10.00  7.94  9.39  8.78 10.08 10.05
## [23] 10.56 10.67  9.78  8.65  8.33 10.04  8.88  9.11  8.21 10.19 11.51
## [34] 12.05 10.27  9.76 10.32
</code></pre></div></div>

<p>Depending on your goals, this may be as far as you need to get. But
maybe you want to display these data visually. Let’s plot the SSTs for a
range of grid cells onto a map.</p>

<h2 id="step-4-convert-the-sst-data-to-a-dataframe">Step 4: Convert the SST data to a <code class="language-plaintext highlighter-rouge">data.frame</code></h2>

<p>Our NetCDF file has a lot of observations:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prod(dim(sst))

## [1] 28576800
</code></pre></div></div>

<p>To reduce the amount of computation, let’s subset the data to a range of
latitudes and longitudes and also focus in on a particular month in the
data set so we can plot this in two dimensions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lat_range &lt;- seq(55.5, 60.5)
lon_range &lt;- seq(190.5, 195.5)

lat_indices &lt;- lat %in% lat_range
lon_indices &lt;- lon %in% lon_range
time_indices &lt;- time_year_months=="1990-06" # June of 1990
</code></pre></div></div>

<p>Notice how I constructed the <code class="language-plaintext highlighter-rouge">_indices</code> variables. They are vectors of
the same length as their corresponding variable but they contain TRUEs
and FALSEs where the corresponding variable is equal to the desired
ranges. This allows instant subsetting:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sst[lon_indices, lat_indices, time_indices]

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 4.20 4.12 4.63 5.47 6.19 6.63
## [2,] 4.39 4.05 4.40 5.18 5.99 6.64
## [3,] 4.71 4.04 4.11 4.73 5.58 6.55
## [4,] 5.21 4.39 4.32 4.82 5.63 6.61
## [5,] 5.57 4.91 4.80 5.17 5.87 6.70
## [6,] 5.53 5.36 5.29 5.55 6.12 6.81
</code></pre></div></div>

<p>Which gives us the SST values for the 5x5 grid for June of 1990. To save
them in a more convenient format, we’ll convert it to a <code class="language-plaintext highlighter-rouge">data.frame</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cdf_df &lt;- expand.grid(lat_range, lon_range)
names(cdf_df) &lt;- c("lat", "lon")
cdf_df$sst &lt;- NA
head(cdf_df)

##    lat   lon sst
## 1 55.5 190.5  NA
## 2 56.5 190.5  NA
## 3 57.5 190.5  NA
## 4 58.5 190.5  NA
## 5 59.5 190.5  NA
## 6 60.5 190.5  NA
</code></pre></div></div>

<p>Now each row of <code class="language-plaintext highlighter-rouge">cdf_df</code> will correspond to one SST value, one row for
every grid cell combination (25 in total). Let’s populate it with SST
values:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for(i in 1:nrow(cdf_df))
{
  lat_ind &lt;- which(lat == cdf_df[i,"lat"])
  lon_ind &lt;- which(lon == cdf_df[i,"lon"])

  cdf_df[i,"sst"] &lt;- sst[lon_ind, lat_ind, time_indices]
}

head(cdf_df)

##    lat   lon  sst
## 1 55.5 190.5 6.63
## 2 56.5 190.5 6.19
## 3 57.5 190.5 5.47
## 4 58.5 190.5 4.63
## 5 59.5 190.5 4.12
## 6 60.5 190.5 4.20
</code></pre></div></div>

<p>Looks good! Let’s make a map to display these data on. Luckily, this is
pretty straightforward in R using the <code class="language-plaintext highlighter-rouge">maps</code> and <code class="language-plaintext highlighter-rouge">mapdata</code> packages.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library(maps)
library(mapdata)
map('world2Hires', xlim=range(lon_range) + c(-10, 10), ylim=range(lat_range) + c(-5, 5))
box()
points(cdf_df$lon, cdf_df$lat)
</code></pre></div></div>

<p><img src="/assets/netcdf-in-r/firstmap-1.png" alt="" /></p>

<p>The above map is a good start. I’ve used the <code class="language-plaintext highlighter-rouge">world2Hires</code> map from
<code class="language-plaintext highlighter-rouge">mapdata</code> which lets me create a map centered on the Eastern Bering Sea.
I’ve specific the x- and y-limits to show just the area where we have
SST values. Let’s change the points to rectangles and make the
background color of the rectangles correlate with the corresponding SST
value for that grid.</p>

<p>We’ll first make the color scale:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ncolors &lt;- 5
cols &lt;- cut(cdf_df$sst, ncolors)
palette &lt;- colorRampPalette((c("blue", "red")))(ncolors)
</code></pre></div></div>

<p>To plot rectangles instead of points, we can use the rect() function
instead of <code class="language-plaintext highlighter-rouge">points()</code>. <code class="language-plaintext highlighter-rouge">rect()</code> draws rectangles on the graphics device
and needs the user to specify the coordinates of each corner (xleft,
ybottom, xright, ytop).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>map('world2Hires', xlim=range(lon_range) + c(-10, 10), ylim=range(lat_range) + c(-5, 5))
box()

grid_hw &lt;- 0.5 # Grid half width
rect(cdf_df$lon - grid_hw, cdf_df$lat - grid_hw, cdf_df$lon + grid_hw, cdf_df$lat + grid_hw, col=palette[cols])

map.axes()
title(main="June SST Values", xlab="Longitude", ylab="Lattitude")
legend("topright", legend=levels(cols), fill=palette)
</code></pre></div></div>

<p><img src="/assets/netcdf-in-r/finalmap-1.png" alt="" /></p>

<p>Looks great! Hopefully this post was a useful introduction to working
with and displaying NetCDF data.</p>]]></content><author><name></name></author><category term="post" /><category term="netcdf" /><category term="r" /><category term="maps" /><summary type="html"><![CDATA[NetCDF is an open file format commonly used to store oceanographic (and other) data such as sea surface temperature (SST), sea level pressure (SLP), and much more. I recently needed to work with SST data from the NCEP Reanalysis and found that I didn’t know how to work with NetCDF files. This post should serve as a short introduction working with NetCDF files using the R package ncdf.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brycemecum.com/assets/images/og/posts/netcdf-in-r.png" /><media:content medium="image" url="https://brycemecum.com/assets/images/og/posts/netcdf-in-r.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>