Playing Scrabble with Xarray#
Welcome to Cameron’s Corner! In my last blog post, I explored how to use index-alignment to solve some simple Scrabble problems. Today I want to do the same using Xarray!
But, before I get started, I want to invite you to our upcoming µtraining (“micro-training”) that we will be hosting on December 19th and 21st. This unique training format ensures direct interaction with instructors and your peers, providing practical insights and immediate problem-solving guidance.
Ready to reshape your Python journey? Join us for two days of immersive learning, where expertise meets real-world application!
Now, back to the word games!
Xarray is a NumPy-like library for manipulating n-dimensional arrays with a mappable coordinate/dimension system. In short, it’s the flexibility of NumPy paired with the power of index-alignment that pandas offers.
Loading Words#
Lets jump right in by loading all of the possible words in our Scrabble dictionary. I’ll then put these words into a xarray.Coordinates
object so that we can easily reuse them later. This last step is not entirely necessary, but it will help us speed up the creation of our xarray.DataArray
s later.
from string import ascii_lowercase
from xarray import Coordinates
with open('./data/words') as f:
all_words = [
line for line in (l.strip() for l in f)
if (
(len(line) >= 3)
and line.isalpha()
and line.isascii()
and line.islower()
)
]
scrabble_coords = Coordinates({
'word': all_words,
'letter': [*ascii_lowercase],
})
scrabble_coords
Coordinates:
* word (word) <U23 'aah' 'aardvark' 'aardvarks' ... 'zygotic' 'zymurgy'
* letter (letter) <U1 'a' 'b' 'c' 'd' 'e' 'f' ... 'u' 'v' 'w' 'x' 'y' 'z'
With pandas, we could use the convenient DataFrame
constructor methods to easily transform our data from a list of dictionaries into a DataFrame
. Unfortunately, no such method exists in Xarray, so I’ll need to “sparsify” the data manually.
In this context, “dense” data would be a direct mapping of letters → counts, where the only letters in the dictionary are letters that appear in the word. Whereas a “sparse” mapping would have all letters in the alphabet, with 0 (zero) counts for letters that do not appear in a given word.
from xarray import DataArray
from string import ascii_lowercase
def letter_counts(text) -> dict[str, int]:
"""Create dictionary of letters → count from some given text
"""
tile_counts = {}
for letter in text:
tile_counts[letter] = tile_counts.get(letter, 0) + 1
return tile_counts
def counts_to_sparse(counts, index=ascii_lowercase):
"""
Convert dictionary of letter → count to a sparse representation
incorporating ALL values in `index`
"""
return [counts.get(l, 0) for l in index]
words_arr = DataArray(
data=[counts_to_sparse(letter_counts(w)) for w in all_words],
coords=scrabble_coords[['word', 'letter']].coords,
)
words_arr
<xarray.DataArray (word: 77790, letter: 26)> array([[2, 0, 0, ..., 0, 0, 0], [3, 0, 0, ..., 0, 0, 0], [3, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 1, 1], [0, 0, 1, ..., 0, 1, 1], [0, 0, 0, ..., 0, 2, 1]]) Coordinates: * word (word) <U23 'aah' 'aardvark' 'aardvarks' ... 'zygotic' 'zymurgy' * letter (letter) <U1 'a' 'b' 'c' 'd' 'e' 'f' ... 'u' 'v' 'w' 'x' 'y' 'z'
Now we have a xarray.DataArray
whose coordinates correspond to each given word and each letter in the alphabet. The value in the array itself corresponds with “how many times this letter appears in this word.”
Modeling Points#
We can model the points that our letters represent via another xarray.DataArray
. By sharing a dimension (“letter”), we can easily align and combine with these two arrays.
points = {
'A': 1, 'E': 1, 'I': 1, 'O': 1, 'U': 1,
'L': 1, 'N': 1, 'S': 1, 'T': 1, 'R': 1,
'D': 2, 'G': 2, 'B': 3, 'C': 3, 'M': 3,
'P': 3, 'F': 4, 'H': 4, 'V': 4, 'W': 4,
'Y': 4, 'K': 5, 'J': 8, 'X': 8, 'Q': 10, 'Z': 10
}
points_arr = DataArray(
counts_to_sparse({k.lower(): v for k, v in points.items()}),
coords=scrabble_coords['letter'].coords
)
points_arr
<xarray.DataArray (letter: 26)> array([ 1, 3, 3, 2, 1, 4, 2, 4, 1, 8, 5, 1, 3, 1, 1, 3, 10, 1, 1, 1, 1, 4, 4, 8, 4, 10]) Coordinates: * letter (letter) <U1 'a' 'b' 'c' 'd' 'e' 'f' ... 'u' 'v' 'w' 'x' 'y' 'z'
If I want to see the total number of points that every word is worth, I can calculate a dot product across the “letter” dimension. This lets xarray
carry out the alignment of my two arrays which then hands off the computation to numpy
.
To visualize the output, I convert the resultant array to a pandas.Series
.
# Words worth the most points
(
words_arr.dot(points_arr, dims='letter')
).to_series()
word
aah 6
aardvark 16
aardvarks 17
aback 13
abacus 10
..
zydeco 21
zygote 19
zygotes 20
zygotic 22
zymurgy 25
Length: 77790, dtype: int64
Instead of finding the value of ALL words, I am interested in finding words I’m able to play given a particular set of tiles. In order to do this, I’ll take a sampling of letters and convert it to a DataArray
that also shares the letter dimension.
To check if a word is playable, I simply need to check if the count of letters in my words array is less than or equal to the counts of letters in my tiles array. If this condition is met for all letters, then the given word is playable.
# Valid words to play given some tiles
tiles = [*'pythonik']
tiles_arr = DataArray(
data=counts_to_sparse(letter_counts(tiles)),
coords=scrabble_coords['letter'].coords
)
(
(words_arr <= tiles_arr).all(dim='letter')
).to_series().loc[lambda s: s]
word
hint True
hip True
hit True
hon True
honk True
...
toy True
typo True
yin True
yip True
yon True
Length: 65, dtype: bool
Finally, let’s put it all together: find words that are playable and then sort them according to their point values.
# Best word to play given some tiles
is_playable = (words_arr <= tiles_arr).all(dim='letter')
word_points = words_arr.dot(points_arr, dims='letter')
(
word_points.sel(word=is_playable)
.sortby(lambda a: a, ascending=False)
).to_series()
word
honky 15
python 14
poky 13
pithy 13
phony 13
..
tin 3
not 3
nit 3
ion 3
int 3
Length: 65, dtype: int64
Wrap-Up#
By investing a little effort up front, we can answer many questions within our game of Scrabble. Index-alignment gives us confidence in our results due to spurious errors about how our arrays might be misaligned, and it empowers us to answer complex questions with simple operations.
If you work with n-dimensional data and keep mixing up your dimensions/coordinates, then you should definitely check out Xarray. Talk to you all next week!