Playing Scrabble with Xarray#

Welcome to Cameron’s Corner! In my last blog post, I explored how to use index-alignment to solve some simple Scrabble problems. Today I want to do the same using Xarray!

But, before I get started, I want to invite you to our upcoming µtraining (“micro-training”) that we will be hosting on December 19th and 21st. This unique training format ensures direct interaction with instructors and your peers, providing practical insights and immediate problem-solving guidance.

Ready to reshape your Python journey? Join us for two days of immersive learning, where expertise meets real-world application!

Now, back to the word games!

Xarray is a NumPy-like library for manipulating n-dimensional arrays with a mappable coordinate/dimension system. In short, it’s the flexibility of NumPy paired with the power of index-alignment that pandas offers.

Loading Words#

Lets jump right in by loading all of the possible words in our Scrabble dictionary. I’ll then put these words into a xarray.Coordinates object so that we can easily reuse them later. This last step is not entirely necessary, but it will help us speed up the creation of our xarray.DataArrays later.

from string import ascii_lowercase
from xarray import Coordinates

with open('./data/words') as f:
    all_words = [
        line for line in (l.strip() for l in f)
        if (
            (len(line) >= 3) 
            and line.isalpha() 
            and line.isascii()
            and line.islower()
        )
    ]
    
scrabble_coords = Coordinates({
    'word': all_words,
    'letter': [*ascii_lowercase],
})

scrabble_coords
Coordinates:
  * word     (word) <U23 'aah' 'aardvark' 'aardvarks' ... 'zygotic' 'zymurgy'
  * letter   (letter) <U1 'a' 'b' 'c' 'd' 'e' 'f' ... 'u' 'v' 'w' 'x' 'y' 'z'

With pandas, we could use the convenient DataFrame constructor methods to easily transform our data from a list of dictionaries into a DataFrame. Unfortunately, no such method exists in Xarray, so I’ll need to “sparsify” the data manually.

In this context, “dense” data would be a direct mapping of letters → counts, where the only letters in the dictionary are letters that appear in the word. Whereas a “sparse” mapping would have all letters in the alphabet, with 0 (zero) counts for letters that do not appear in a given word.

from xarray import DataArray
from string import ascii_lowercase

def letter_counts(text) -> dict[str, int]:
    """Create dictionary of letters → count from some given text
    """
    tile_counts = {}
    for letter in text:
        tile_counts[letter] = tile_counts.get(letter, 0) + 1
    return tile_counts

def counts_to_sparse(counts, index=ascii_lowercase):
    """
    Convert dictionary of letter → count to a sparse representation
    incorporating ALL values in `index`
    """
    return [counts.get(l, 0) for l in index]

words_arr = DataArray(
    data=[counts_to_sparse(letter_counts(w)) for w in all_words],
    coords=scrabble_coords[['word', 'letter']].coords,
)

words_arr
<xarray.DataArray (word: 77790, letter: 26)>
array([[2, 0, 0, ..., 0, 0, 0],
       [3, 0, 0, ..., 0, 0, 0],
       [3, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 1],
       [0, 0, 1, ..., 0, 1, 1],
       [0, 0, 0, ..., 0, 2, 1]])
Coordinates:
  * word     (word) <U23 'aah' 'aardvark' 'aardvarks' ... 'zygotic' 'zymurgy'
  * letter   (letter) <U1 'a' 'b' 'c' 'd' 'e' 'f' ... 'u' 'v' 'w' 'x' 'y' 'z'

Now we have a xarray.DataArray whose coordinates correspond to each given word and each letter in the alphabet. The value in the array itself corresponds with “how many times this letter appears in this word.”

Modeling Points#

We can model the points that our letters represent via another xarray.DataArray. By sharing a dimension (“letter”), we can easily align and combine with these two arrays.

points = {
    'A': 1, 'E': 1, 'I': 1, 'O': 1, 'U': 1,
    'L': 1, 'N': 1, 'S': 1, 'T': 1, 'R': 1,
    'D': 2, 'G': 2, 'B': 3, 'C': 3, 'M': 3,
    'P': 3, 'F': 4, 'H': 4, 'V': 4, 'W': 4,
    'Y': 4, 'K': 5, 'J': 8, 'X': 8, 'Q': 10, 'Z': 10
}

points_arr = DataArray(
    counts_to_sparse({k.lower(): v for k, v in points.items()}),
    coords=scrabble_coords['letter'].coords
)

points_arr
<xarray.DataArray (letter: 26)>
array([ 1,  3,  3,  2,  1,  4,  2,  4,  1,  8,  5,  1,  3,  1,  1,  3, 10,
        1,  1,  1,  1,  4,  4,  8,  4, 10])
Coordinates:
  * letter   (letter) <U1 'a' 'b' 'c' 'd' 'e' 'f' ... 'u' 'v' 'w' 'x' 'y' 'z'

If I want to see the total number of points that every word is worth, I can calculate a dot product across the “letter” dimension. This lets xarray carry out the alignment of my two arrays which then hands off the computation to numpy.

To visualize the output, I convert the resultant array to a pandas.Series.

# Words worth the most points
(
    words_arr.dot(points_arr, dims='letter')
).to_series()
word
aah           6
aardvark     16
aardvarks    17
aback        13
abacus       10
             ..
zydeco       21
zygote       19
zygotes      20
zygotic      22
zymurgy      25
Length: 77790, dtype: int64

Instead of finding the value of ALL words, I am interested in finding words I’m able to play given a particular set of tiles. In order to do this, I’ll take a sampling of letters and convert it to a DataArray that also shares the letter dimension.

To check if a word is playable, I simply need to check if the count of letters in my words array is less than or equal to the counts of letters in my tiles array. If this condition is met for all letters, then the given word is playable.

# Valid words to play given some tiles

tiles = [*'pythonik']
tiles_arr = DataArray(
    data=counts_to_sparse(letter_counts(tiles)),
    coords=scrabble_coords['letter'].coords
)

(
    (words_arr <= tiles_arr).all(dim='letter')
).to_series().loc[lambda s: s]
word
hint    True
hip     True
hit     True
hon     True
honk    True
        ... 
toy     True
typo    True
yin     True
yip     True
yon     True
Length: 65, dtype: bool

Finally, let’s put it all together: find words that are playable and then sort them according to their point values.

# Best word to play given some tiles
is_playable = (words_arr <= tiles_arr).all(dim='letter')
word_points = words_arr.dot(points_arr, dims='letter')

(
    word_points.sel(word=is_playable)
    .sortby(lambda a: a, ascending=False)
).to_series()
word
honky     15
python    14
poky      13
pithy     13
phony     13
          ..
tin        3
not        3
nit        3
ion        3
int        3
Length: 65, dtype: int64

Wrap-Up#

By investing a little effort up front, we can answer many questions within our game of Scrabble. Index-alignment gives us confidence in our results due to spurious errors about how our arrays might be misaligned, and it empowers us to answer complex questions with simple operations.

If you work with n-dimensional data and keep mixing up your dimensions/coordinates, then you should definitely check out Xarray. Talk to you all next week!