Reading the TL2 PAK file

I’m gradually continuing to make progress with FNIStash, gradually being the key word. It’s pretty tough to learn a new language, decode a file format, and plan a wedding at the same time.

Lately, I’ve been working on a Haskell module to decode the Torchlight 2 PAK file. The PAK file is a single archive file that contains all the configuration and asset files for the game (well, at least most of them). I’m hoping to read this PAK file and decode its contents so I can, for instance, grab the bitmaps for the various inventory items and use them in the GUI for FNIStash.

In writing the PAK module, I came tried a few different strategies for parsing the binary data, with varying levels of success. I won’t describe the PAK format here because it is already well described in the readme file included in CUE’s Torchlight II packer.

Objective

The PAK.MAN file associates a file path to a particular part of the PAK file. What I’d like to do is create a lookup table of all the files archived in the PAK file. The key would be the file path, and the value would be the right data out of the PAK file. I call this a PAKEntry. So the mapping would look something like

type PAKFiles = M.Map FilePath PAKEntry

with PAK entries being parsed with the following Get operation

getPAKEntry :: Get PAKEntry
getPAKEntry = do
    hdr <- getWord32le
    decSize <- getWord32le
    encSize <- getWord32le
    encData <- getLazyByteString $ fromIntegral encSize
    return $ PAKEntry hdr decSize encSize encData

Attempt 1 – Using lazy Get

My first attempt used the lazy binary Get monad. This was also lazy on my part. My thinking here was that, since I already head a Get monad operation coded up that could parse out a PAKEntry, I could just add a line at the beginning of that operation that skips to the right part of the file. Something like

(skip . fromIntegral) (offset-4)

with the offset being passed in as a function argument (so the type signature would be Word32 -> Get PAKEntry). This worked, but was horrendously slow. What this does is read all the data of the PAK file into memory up to the offset (approximately), and then reads in as much more as it needs to parse out the PAK entry. For parsing entries near the end of the 800 MB file, this took a long time. I used this function to construct the table:

readPAKFiles :: FilePath -> FilePath -> IO PAKFiles
readPAKFiles manFile pakFile = do
    man <- readPAKMAN manFile
    pak <- BS.readFile pakFile
    let fileList = pakFileList man
        offsetList = pakFileOffsets man
        fileOffsetList = L.zip fileList offsetList
        f offset = flip runGet pak ((getPAKEntry . fromIntegral) offset)
        mapList = L.zip fileList (L.map f offsetList) :: [(FilePath, PAKEntry)]
    return $ M.fromList mapList

Attempt 2 – Using strict Get

Attempt 2 used the strict Get monad and strict bytestrings. The code is nearly identical to attempt 1. Here, the problem was that the BS.readFile function read in the entire contents of the PAK file to memory, then skipped to the right location of the PAK entry (instead of streaming the file into memory as necessary as it did with lazy strings). This was *much* faster, but required a huge 800 MB of memory to be read in just to parse out a tiny 26k PNG file (the file I was using for testing.).

Attempt 3 – Handle and lazy Get

Check out the type signature of the PAKFiles type. It maps a FilePath to a PAKEntry. No where in that definition is IO mentioned. So PAKFiles cannot do any IO, by definition. The previous two attempts have PAKFiles defined as a pure result of parsing binary data read through IO. What I really needed is for PAKFiles to have this type signature:

type PAKFiles = M.Map FilePath (IO PAKEntry)

Now, the result of a lookup operation is an IO action that retrieves a PAKEntry.

Reading the PAK entry lazily is the behavior I wanted to keep because I didn’t know how much data I needed to read in until after I started reading the entry. But I didn’t want to lazily read in and throw away the data in order to skip to the right part in the file. My next attempt used handles:

readPAKFiles :: FilePath -> FilePath -> IO PAKFiles
readPAKFiles manFile pakFile = do
    man <- readPAKMAN manFile
    let fileList = pakFileList man
        offsetList = pakFileOffsets man
        fileOffsetList = L.zip fileList offsetList
        f offset = do
            h <- openBinaryFile pakFile ReadMode
            hSeek h AbsoluteSeek (fromIntegral offset - 4)
            pak <- BS.hGetContents h
            let parsedEntry = flip runGet pak getPAKEntry
            hClose h
            return parsedEntry
        mapList = L.zip fileList (L.map f offsetList) :: [(FilePath, IO PAKEntry)]
    return $ M.fromList mapList

There’s a problem with this, though. I kept getting errors that the handle was closed before I tried to read it!

Attempt 4 – (Mostly) success with $!

The problem with attempt 3 is that runGet is executed lazily, but hOpen and hClose are not. When runGet is finally performed, the handle is already closed. This is easily fixed with the $! operator, which strict evaluates its argument:

readPAKFiles :: FilePath -> FilePath -> IO PAKFiles
readPAKFiles manFile pakFile = do
    man <- readPAKMAN manFile
    let fileList = pakFileList man
        offsetList = pakFileOffsets man
        f offset = do
            withBinaryFile pakFile ReadMode (\h -> do
                hSeek h AbsoluteSeek (fromIntegral offset - 4)
                pak <- BS.hGetContents h
                return $! (flip runGet pak getPAKEntry))
        mapList = L.zip fileList (L.map f offsetList) :: [(FilePath, IO PAKEntry)]
    return $ M.fromList mapList

This version of the operation is fast and uses a low amount of memory. By “low”, I mean 95 MB to parse out a 26k file. Something is wrong there, which is my next thing to investigate. Maybe that will make for another post.

– Update 01/23/2013 –

This post last ended with my creating a lookup table that generated IO actions that returned PAKEntry records. This turns out to be a horrible idea. Every time a lookup is made, IO has to be run. In order to do this, you need to pollute huge parts of your program with the IO monad. Instead, I reformulated this to be pure and more forward looking – you give the function a clue of what content you want in the table (such as a file path prefix), and it will read all of the necessary files into memory right then and there. After that, lookups can be done without the IO monad.

fluffynukeit

Occasional bursts of productivity