OpenStax Word Doc Conversion Project: Part 1

Posted on Jan 11, 2026
tl;dr: Textbook conversion project: converting docx files to HTML.

Project Overview

As part of the OER challenge grant we recently received at Shepherd, I’m converting OER course materials published by OpenStax to other formats. The two textbooks, Principles of Marketing and Introductory Business Statistics 2e, are in many way high-quality products. But there are a few problems to deal with:

  • Although OpenStax provides a Common Cartridge package for importing into an LMS (Learning Management System), it’s really just a bunch of references to their web site in a trenchcoat. This makes it impossible to customize the pages and creates a dependency on their web site. It also feels pretty spammy. (I don’t blame them, really. Everyone has to make a living. Still…)
  • Although OpenStax does provide docx files for those who want to adapt the text, whatever process produced them is probably a little scary under the hood. (As you’ll see, it created some problems for me!) They also don’t completely match the very polished PDFs that they distribute. (Another argument for creating everything in markdown and using open source tools, but that’s a sermon for another day.)
  • Although the texts themselves are complete (including practice and homework problems), and some ancillary materials are provided, only one of the two comes with a test bank. It’s understandable that the test bank is not openly licensed (OER needs a better solution for this!), but it’s also only provided as a set of docx files. One suspects that this is due to OpenStax’s business model, rather than technical limitations…

So, my plans to adapt these materials for the two relevant classes here at Shepherd (BADM 340 and BADM 224, respectively) include the following:

  • Convert the docx files to HTML.
  • Package the HTML files for importing into our LMS (Brightspace—though, ultimately, we also want to make the packages available for other users) and/or importing from our LOR (Learning Object Repository).
  • Convert the available test bank, and possibly also end-of-chapter exercises, into an LTI package for importanting into the LMS and/or LOR.
  • Possibly also create new test banks for one or both texts.
  • Assist the instructors with adapting the texts and/or slides and converting them to multiple formats.

As always, my goal is to accomplish all of this using open source software!

Having given you my desideratum, let’s turn to how I’m accomplishing the first item above…

Testing the Conversion, Finding Problems

I started off thinking I would script the conversion using a DOS batch file, to keep the solution usable by folks on Windows. My first test of converting the docx files, using my old friend Pandoc, looked like this:

for %%F in (*.docx) do pandoc -f docx -t html5 "%%~nF.docx" -o "%%~nF.html" --ascii --embed-resources --webtex

Inspecting the results, I saw three problems:

  1. All the images were mising. After switching to using --extract-media instead of --embed-resources, I saw that all the extracted files were named with an .so file extension. (Apparently, short for “shared object”?) The actual contents of the files started with the appropriate magic bytes, and they worked fine once renamed and referenced appropriately, but unarchiving the docx file confirmed that they were all had names like rId20.so within the word/media folder.
  2. As you can imagine, the statistics text is pretty equation-heavy. Unfortunately, the equations seem to have been created in Word’s equation editor using non-standard characters, which caused the conversion using --webtex to fail in many places. After backing up a step and converting to markdown first, I asked Perplexity to analyze one of the resulting LaTeX equations. It reported that the formula as written is not valid LaTeX math, mainly because of the characters used for the minus sign, the multiplication operator, and the Greek letter Rho. Maybe some of this is Pandoc’s fault, but I don’t really believe that. (I should add that some of the “equations” seem to just be regular text, outside of an equation environment—or whatever Word calls it.) I did make a note to give the Mammoth .docx to HTML converter a try at some point. But, since I also want to be able to work with markdown source, that wasn’t the best option for now.
  3. Some of the markdown code for the images in the statistics text didn’t get cleanly converted into HTML. Somehow, the bits like {width="6.5in" height="3.0203423009623798in"} got separated from the rest of the code for the image or from each other.

At this point, it was clear that I was going to want sed, so I decided to abandon the batch file and create a shell script instead. Still, keeping in mind that I wanted my solution to be accessible to as many folks as possible, I decided to test my script under WSL (the Windows Subsystem for Linux).

For this, I needed to install Pandoc on the WSL side:

apt-get install texlive
wget https://github.com/jgm/pandoc/releases/download/3.8.3/pandoc-3.8.3-1-amd64.deb
sudo dpkg -i pandoc-3.8.3-1-amd64.deb

Fixing the Equations

Although I still need to review the issue with the equations carefully, I resolved it fairly quickly for this first version.

For the past four or five years, I’ve been creating HTML versions of my own textbook content using Pandoc’s webtex option:

Convert TeX formulas to <img> tags that link to an external script that converts formulas to images. The formula will be URL-encoded and concatenated with the URL provided. For SVG images you can for example use –webtex https://latex.codecogs.com/svg.latex?. If no URL is specified, the CodeCogs URL generating PNGs will be used (https://latex.codecogs.com/png.latex?). Note: the –webtex option will affect Markdown output as well as HTML, which is useful if you’re targeting a version of Markdown without native math support.

I originally switched to this approach when working with Canvas as the LMS and running into problems with MathML support. But I’m using Brightspace these days, and anyway whatever the problem was with Canvas circa 2021 is probably fixed by now. I’ve been meaning to revisit this anyway, because I suspect that MathML may be better for accessibility; I don’t like depending on the external service from codecogs; and I’m not sure which solution will render best in EPUB3.

Switching to MathML here (replacing --webtex with --mathml on the Pandoc command line) seems to have solved my problem! I assume that browsers are a bit more tolerant of whatever cruft is in those equations, and I am simply taking the win for now!

Fixing the Chopped-Up Markdown

I spent just enough time on this issue to get an acceptable solution, and will want to revisit it later, but solving it was easy enough.

While I didn’t verify this rigorously, the issue seemed to stem from some images being placed inside 1x1 tables in MS Word. Why do that? I don’t know. Some weird formatting thing, maybe? (WYSIWYG word processors, not even once!) Anyway, the result in the markdown that Pandoc produced was a 1x1 grid table with multiple rows of markdown inside. Because the parts of the formatting instructions for the width and height were split over multiple lines, with vertical border in between them, Pandoc couldn’t parse the code for the embedded image correctly.

My solution was to replace -t markdown with -t markdown-grid_tables on the Pandoc command line. This disables the grid_tables extension. (Conversely, to explicitly enable a Pandoc extension, replace the - with a +.)

The result for these problem blocks of code is now something like this:

  ------------------------------------------------------------
  ![This photo shows a large open news room with enough space
                       to seat about 200
    employees.](1-introduction/media/rId20.jpg){width="6.5in"
                 height="3.0203423009623798in"}

  ------------------------------------------------------------

This isn’t ideal. But, now that the vertical border is out of the way, Pandoc seems to handle these just fine. (There might still be a couple of problem children somewhere in the textbook, but I’ll have to chase those down later.)

Extracting Media from docx

Deep Dive: Why do we end up with these .so files?

A docx file is actually a zip archive containing a bunch of XML files and whatever other files are embedded in the “document”. So, to inspect the contents, you can rename the file with a .zip extension and examine it with your tool of choice (e.g., Windows Explorer). The graphics we want to extract are stored in the word/media folder and, in this case, all end in .so. So how does Word know that they are graphics? Taking 1-introduction.docx from the statistics text as an example:

  1. The file _rels/document.xml.rels explicitly declares that the file media/rId20.so is an image:

    <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Id="rId20" Target="media/rId20.so" />
    
  2. But the file [Content_Types].xml, rather than giving an appropriate MIME type (in this case, image/jpeg), goes out of its way to declare it as application/octet-stream:

    <Override PartName="/word/media/rId20.so" ContentType="application/octet-stream"/>
    

    Why? I can only assume that whatever gods-forsaken tooling OpenStax used to create these JPEGs is to blame. (Testing in a freshly-created Word doc with the same image results in the correct MIME type.)

  3. So, Word has to fall back on examining the octet stream directly, where it will find the magic bytes ffd8ffe000104A4649460001 (the string containing JFIF at the start of the file, if you examine it in a text editor), revealing that it’s a JPEG (or,more precisely, JFIF).

So, if Word is smart enough to do all this, why isn’t Pandoc? I don’t know, but I assume it’s because this twisted setup is unusual enough that no one has gotten around to coding the necessary logic yet. Instead, Pandoc appears to do what is probably good enough for the vast majority of documents: it sees <pic:blipFill> <a:blip r:embed="rId20"/>...</pic:blipFill> in document.xml; looks in the aforementioned document.xml.rels to see that the id is associated with rId20.so; and then extracts rId20.so as-is and links to it by that name (trusting, fatally in this case, that some web server is going to serve it with an appropriate MIME type when the HTML page is viewed).

The Solution

I tackled the marketing text first, since I didn’t have to worry so much about the equations there, and I had a working solution fairly quickly: I simply renamed all the .so files to .png and used sed to rewrite the references in the markdown source. After packaging everything up and importing it into Brightspace, I confidentally turned to the statistics text.

Like the Good Book says, pride goeth before destruction.

I had gotten lucky with the marketing text: all the images were PNGs and, despite being in embedded in different docx files, each image had a unique id/filename. Not so with the statistics text! I would have to make my script a bit more robust.

At first I tried using the magic bytes directly to detect the type of the extracted file, but that felt overly hacky and complicated. Since I was already using Linux, why not let find do the hard work for me?

I also needed to keep files from being overwritten. The problem was that, by default, Pandoc was extracting all of the images to the media subdirectory. So I needed to use --extract-media to tell Pandoc to use a different directory for each docx file.

Finally, now that I was dealing with JPEGs as well as PNGs, I needed to check each file’s type and rewrite the markdown source appropriately.

Having resolved to experiment more with “vibe coding”, this seemed like a nice, small problem to hand over to ChatGPT. Below is the prompt I gave ChatGPT (including my typo) and the results I got back. If you are one of my students and this looks a lot like my homework instructions in Spreadsheet Fundamentals or Business Information Systems, this is not a coincidence! It turns out that learning to write clear homework instructions is great training for prompting AIs…

I'm processing multiple docx files, and some of them use the same filename for different images. Right now, this results in same images being overwritten. I need to modify my script below to handle this case:

```bash
#!/usr/bin/env bash

set -x

# Convert all DOCX files to Markdown
for f in *.docx; do
  [ -e "$f" ] || continue
  base="${f%.docx}"

  pandoc -f docx -t markdown-grid_tables "$f" -o "$base.md" --extract-media=.
done

# Recursively rename .so files based on detected image type using `file`
find ./media -type f -name '*.so' | while IFS= read -r f; do
  base="${f%.so}"

  # Identify file type
  type=$(file -b --mime-type "$f")

  case "$type" in
    image/png)
      mv -- "$f" "$base.png"
      ;;
    image/jpeg)
      mv -- "$f" "$base.jpg"
      ;;
    image/gif)
      mv -- "$f" "$base.gif"
      ;;
    *)
      echo "Unrecognized or unsupported file type ($type): $f"
      ;;
  esac
done

# change filename extensions for the images in the Markdown source 
for f in *.md; do
  [ -e "$f" ] || continue
  sed -i 's/\.so/.png/g' "$f"
  sed -i 's!\./media/!media/!g' "$f"
done

# Convert all Markdown files to HTML5
for f in *.md; do
  [ -e "$f" ] || continue
  base="${f%.md}"

  pandoc -f markdown -t html5 "$f" -o "$base.html" --ascii --mathml
done
```

Please create a new script with the following modifications:

1. When converting the the docx files to markdown using Pandoc, extract the media to a subdirectory based on the filename. For example, if converting `1-introduction.docx`, use `--extract-media=1-introduction`.

2. When renaming the .so files, search inside the subdirectories to find files and rename them there. For example, the images for `1-introduction.docx` will be inside the subdirectory `1-introduction/media`. You should still detect the file type and rename the file appropriately.

3. When changing the filename extensions for the images, we need to know whether to change the extension from `.so` to `.png`, `.jpg`, or `.gif`. Therefore, use the base of the filename to check the filename from the previous step and change the markdown source appropriately. For example, if the markdown source inside the file `1-introduction.md` references the file `1-introduction/media/rId20.so`, you should check for `1-introduction/media/rId20.png`, `1-introduction/media/rId20.jpg`, and `1-introduction/media/rId20.gif` until you find an existing file and then rename to match the file you found.

Please make all the above changes and give me the new script.

Here’s the script that GPT produced:

#!/usr/bin/env bash

set -euo pipefail
set -x

############################################
# 1. Convert DOCX → Markdown, extract media
#    into per-document subdirectories
############################################

for f in *.docx; do
  [ -e "$f" ] || continue
  base="${f%.docx}"

  pandoc \
    -f docx \
    -t markdown-grid_tables \
    "$f" \
    -o "$base.md" \
    --extract-media="$base"
done

############################################
# 2. Rename .so files based on detected type
#    (PNG / JPEG / GIF), inside subdirectories
############################################

find . -type f -path '*/media/*.so' | while IFS= read -r f; do
  base="${f%.so}"
  type=$(file -b --mime-type "$f")

  case "$type" in
    image/png)
      mv -- "$f" "$base.png"
      ;;
    image/jpeg)
      mv -- "$f" "$base.jpg"
      ;;
    image/gif)
      mv -- "$f" "$base.gif"
      ;;
    *)
      echo "Unrecognized or unsupported file type ($type): $f" >&2
      ;;
  esac
done

############################################
# 3. Update Markdown image references
#    by resolving the actual renamed file
############################################

for md in *.md; do
  [ -e "$md" ] || continue

  docbase="${md%.md}"

  # Process each referenced .so file found in the markdown
  grep -oE "${docbase}/media/[^ )]+\.so" "$md" | sort -u | while read -r ref; do
    stem="${ref%.so}"

    if [ -f "${stem}.png" ]; then
      sed -i "s|${ref}|${stem}.png|g" "$md"
    elif [ -f "${stem}.jpg" ]; then
      sed -i "s|${ref}|${stem}.jpg|g" "$md"
    elif [ -f "${stem}.gif" ]; then
      sed -i "s|${ref}|${stem}.gif|g" "$md"
    else
      echo "Warning: no renamed image found for ${ref} in ${md}" >&2
    fi
  done
done

############################################
# 4. Convert Markdown → HTML5
############################################

for f in *.md; do
  [ -e "$f" ] || continue
  base="${f%.md}"

  pandoc \
    -f markdown \
    -t html5 \
    "$f" \
    -o "$base.html" \
    --ascii \
    --mathml
done

The only issue I ran into with the script above is that the pipefail option doesn’t seem to be supported when running under WSL, even though I double-checked and the shell does seem to be bash.

Now I had a (reasonably-well) working solution and was able to move on to packaging the content for importing into the LMS. But that’s a story for another post!

I’ll still want to revisit this script to fix any remaining issues; add the exact credit requested by OpenStax; and improve the formatting, but I feel like I’m 80–90% of the way there. One change that might improve things a great deal is going back to embedding the images in the HTML instead of referencing external files. (All those subdirectories and files really clutter up the LMS file storage and slow down the importing of the package.) That will only work if the total amount of data is small enough, but each page is a short section with only a few images. If that doesn’t work, another approach would be to use the docx filename to construct a unique image filename instead of creating all those subdirectories.

Using AI

This seems like a good place to also note that even the latest AI models are still prone to confidently giving bad information. When asking ChatGPT for some information about the docx format (formally, OOXML), trying to figure out why I was getting those .so files, it confidently told me: When you see files extracted from a .docx that are named with a .so extension, that extension did not exist inside the DOCX. It is being introduced by the tool performing the extraction or conversion. If this had been my first rodeo, I might have believed that. Instead, I gave it a sample file as proof that it was wrong, and got back the following:

Thank you for the counterexample. You are correct, and your attached file is decisive.

After inspecting the DOCX you provided, the earlier claim that "DOCX never stores images with a .so extension" is incorrect. What you are observing is valid OOXML, and Microsoft Word is behaving as designed.

Below is the precise explanation.

This new explanation was more on-target.

Nor was this the only time I had to correct ChatGPT.

Despite this, using AI (ChatGPT and Perplexity, in this case) did make me more productive. Because I was prompting it in a targeted way, working on one small chunk of the problem at a time, I was able to steer it back on course quickly and have it provide the information I needed or (as above) write code that I could have written but which would have taken me (a dilettante programmer) longer to figure out the correct syntax for.

I would summarize the current state of “vibe coding” as something like “programming without the coding”. Even for a small script like the one above, getting good results still requires understanding the problem and designing a strategy to solve it. But all the fiddly bits? The ones where I would normally look at man pages or other documentation to figure out why I’m getting syntax errors from some command that I rarely use? Yeah, ChatGPT took care of all that for me. When it works, it’s liberating, and I can see why some people are very excited about programming with AI.

Next?

In part 2, I’ll talk about packaging the files for import into the LMS. Until then, in lieu of a comments section, feel free to DM me on the twitters with questions and suggestions!