Strings, OsStrings, and PathBufs in Rust

Published: May 06, 2021   |   Read time:

Tagged:

When picking a programming language, there is always a hidden choice to be made. If you want rigor and speed, you can pick a statically typed language. But if you choose this, you will spend much of your time figuring out ways to convert between types. Conversely, if you want flexibility and something easy to read, you can pick a dynamically typed language. But if you choose this, you will spend much of your time figuring out that the root of all the nonsensical errors in your code come from types being incorrectly converted. You’re damned if you do and damned if you don’t.

The Rust language falls on the statically typed side of the fence, here. And, as a result, there are many ways to convert between different data types.

One of the types I have always had trouble dealing with is the PathBuf type 1. PathBufs are closely related to the OsString, String, &Path, &OsStr, and &str types, and converting between them always feels unclear to me. Here, I want to break down how these types are related to each other, why they’re different, and how to convert between them.

Strings and slices

I’ll start with the obligatory discussion about the heap. Without going too deep, String-like types don’t necessarily have a known size at compile time. This means you can’t allocate a fixed size in memory and be guaranteed that a particular String won’t ever need more than that. To get around this problem, Rust, like other low-level languages, implement Strings as a vector by allocating a certain capacity of memory on the heap. You can see this in the definition of the String type 2:

pub struct String {
    vec: Vec<u8>
}

Rust lets you access this data by accessing the vector-like struct directly or by a slice. Slices, unlike Strings, will have a known size at compile time. Slices are made of two parts: a pointer to the starting memory location and the length of the slice 3. Slices, being a fixed size, can be stored on the stack. While the slice is a primitive type and lives on the stack, you can access what the slice is referencing in the heap using the & reference operator.

Strings and str slices -100%

PathBuf and OsString

There are special types of strings that you often need to work with that have some structure for parsing them correctly. Paths on your computer are a good example of this. Navigating these paths is necessary to work with other data on your computer.

The PathBuf and Path types are what you would use to handle file paths, regardless of the operating system. From the Rust documentation, a PathBuf is

An owned, mutable path (akin to String)

and a Path is

A slice of a path (akin to str).

PathBuf are to Strings as &Paths are to &strs. But PathBufs and Paths have extra features.

PathBuf and Path types -100%

But when we look at the PathBuf definition to see how it’s declared, we find something unexpected:

pub struct PathBuf {
    inner: OsString,
}

Instead of PathBuf being an extension to String, there’s this intermediate OsString type we have to deal with. But what is an OsString and why do we need it? OsString is described as:

A type that can represent owned, mutable platform-native strings, but is cheaply inter-convertible with Rust strings. The need for this type arises from the fact that:

  • On Unix systems, strings are often arbitrary sequences of non-zero bytes, in many cases interpreted as UTF-8.
  • On Windows, strings are often arbitrary sequences of non-zero 16-bit values, interpreted as UTF-16 when it is valid to do so.
  • In Rust, strings are always valid UTF-8, which may contain zeros. OsString and OsStr bridge this gap by simultaneously representing Rust and platform-native string values, and in particular allowing a Rust string to be converted into an “OS” string with no cost if possible.

If it’s hard to piece together what’s going on here, let’s summarize the situation.

  • Rust Strings are valid UTF-8 strings.
  • File paths are like strings with extra structure, we use the PathBuf type to handle them nicely instead of doing a bunch of string manipulation.
  • Different operating systems represent strings differently under the hood (i.e. they may not be valid UTF-8). Since paths are a special kind of string in each OS, paths are represented differently under the hood in each OS.
  • If we want PathBuf to be a single interface for working with paths, regardless of the OS, we need to have some way of handling OS-specific strings.
  • There are more use cases for handling OS-specific strings than just paths, so let’s put all that OS-specific stuff into its own type, called OsString.

In even simpler terms: PathBuf encapsulates how operating systems represent paths whereas OsString encapsulates how operating systems represent strings 4. Unfortunately, we need both of these types to have our cake and eat it, too.

Converting between types

To convert between all these types, we can look for implementations in the Rust code and documentation. All the owned types can be freely referenced with the .as_ref() method, but converting to and from Strings requires memory allocation.

Conversion map -100%

We get a free bonus converting between OsString and PathBuf since they are essentially the same thing, under the hood. But this map has helped me convert between these types a bit more easily without needing to think about it so much.

Conclusions

My general advice is that if you are trying to work with paths, try your best to use PathBuf. But if you want your code to run on multiple operating systems, you will inevitably run into some OsStrings that you will probably need to convert into a valid PathBuf. You can use the map above to easily convert between them.

This post isn’t exhaustive. There are still edge cases to deal with, like the OS-specific OsStringExt on Unix, WASI and Windows. But at least this map will serve you for most purposes.

Man, computers are complicated.

Footnotes

  1. I’m using the terms and type and struct loosely here. Rust has specific definitions for these two terms, but “type” is used more generally in other languages. I’ll mostly be using the term “type” in this post. 

  2. Strings in Rust aren’t vectors-of-characters like they are in other languages. To ensure generality, all Strings are guaranteed to be proper UTF-8 strings. This means that instead of being a vector-over-characters, Rust Strings are vectors-over-bytes. A single byte can be stored as a u8 type, which is why the String type is really just a wrapper around the Vec<u8>

  3. Because Strings are really Vec<u8>, the slice is not a slice over characters, but a slice over bytes. This can make working with the &str string slice weird sometimes. As the Rust docs note, the length of a string slice is “length is in bytes, not chars or graphemes. In other words, it may not be what a human considers the length of the string”. 

  4. Thanks cheat.rs