M0UNTAIN 0F C0DE

100 Million Row Challenge

PHP performance

Brendt recently issued a challenge to the PHP community:

The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible.

The full rules and prizes can be found here.

This was somewhat fortunate timing for me as just the day before I had read this excellent article about approaching a similar problem.

I didn't expect to win any prizes, but that wasn't why I wanted to give it a go.

First Pass

The first approach was pretty simple, keep looping while there are bytes left to read and use stream_get_line to extract parts of each line. It was immediately obvious that parsing the date and the URL with the likes of parse_url or DateTime wasn't required and would add a lot of overhead.

public function parse(string $inputPath, string $outputPath): void
{
    $fp = fopen($inputPath, 'r');
    $bytes = filesize($inputPath);

    $data = [];

    while (ftell($fp) < $bytes) {
        fseek($fp, strlen('https://stitcher.io'), SEEK_CUR);
        $path = stream_get_line($fp, 1000, ",");
        $date = stream_get_line($fp, 1000, "T");
        stream_get_line($fp, 1000, "\n");

        $data[$path][$date] ??= 0;
        $data[$path][$date]++;
    }

    foreach ($data as &$values) {
        ksort($values);
    }

    file_put_contents($outputPath, json_encode($data, JSON_PRETTY_PRINT));
}

Inlining Static strlen Calls

I thought that PHP inlined calls to strlen with static strings, but it appeared to be faster to just hardcode the number. Probably because Opcache is disabled on the CLI?

Before:

fseek($fp, strlen('https://stitcher.io'), SEEK_CUR);

After:

fseek($fp, 19, SEEK_CUR);

Single line read beats multiple calls to stream_get_line

Turns out using string manipulation on a whole row is faster than calling stream_get_line multiple times to get each part.

Before:

while (ftell($fp) < $bytes) {
    fseek($fp, 19, SEEK_CUR);
    $path = stream_get_line($fp, 1000, ",");
    $date = stream_get_line($fp, 1000, "T");
    stream_get_line($fp, 1000, "\n");
    
    //...
}

After:

while (($line = stream_get_line($fp, 4096, "\n")) !== false) {
    $commaPos = strpos($line, ',', 19);
    $path = substr($line, 19, $commaPos - 19);
    $date = substr($line, $commaPos + 1, 10);
    
    //...
}

Stream Buffer Size

I suspect this is platform-dependent, but manually setting the read buffer size increased the throughput by a significant percentage.

stream_set_read_buffer($fp, 65536);

I did a benchmark to find the fastest buffer size:

foreach ([16384, 32768, 65536, 262144, 1048576, 4194304] as $size) {
    $start = microtime(true);
    $fp = fopen($file, 'rb');
    stream_set_read_buffer($fp, $size);
    while (!feof($fp)) {
        $data = fread($fp, $size);
    }
    fclose($fp);
    printf("Buffer %7d: %.3fs\n", $size, microtime(true) - $start);
}

Time For Some Benchmarking

I did a little benchmarking at this point. I needed to know where to focus. I couldn't imagine where there was further improvement.

$fp = fopen($inputPath, 'r');
stream_set_read_buffer($fp, 65536);

$data = [];

$str = 0;
$increments = 0;
$loop = -microtime(true);
while (($line = stream_get_line($fp, 4096, "\n")) !== false) {
    $strs = -microtime(true);
    $commaPos = strpos($line, ',', 19);
    $path = substr($line, 19, $commaPos - 19);
    $date = substr($line, $commaPos + 1, 10);
    $strs += microtime(true);

    $str += $strs;

    $increment = -microtime(true);
    $data[$path][$date] ??= 0;
    $data[$path][$date]++;
    $increment += microtime(true);

    $increments += $increment;
}

$loop += microtime(true);

$sort = -microtime(true);
foreach ($data as &$values) {
    ksort($values);
}
$sort += microtime(true);

$write = -microtime(true);
file_put_contents($outputPath, json_encode($data, JSON_PRETTY_PRINT));
$write += microtime(true);

var_dump($loop, $str, $increments, $sort, $write);
float(7.383) // loop - obviously this includes all the benchmarking ops in the loop, without those it's more ~5.5s
float(1.347) // string ops
float(4.276) // increment ops
float(0.210) // sort
float(0.080) // write

isset() Beats ??=

The majority of the time was being spent on this block:

$data[$path][$date] ??= 0;
$data[$path][$date]++;

Thinking about it, this is shorthand for three separate operations: isset, assign variable and increment:

if (isset($data[$path][$date])) {
    $data[$path][$date] = 0;
}
$data[$path][$date]++;

This can easily be reduced to two operations:

if (isset($data[$path][$date])) {
    $data[$path][$date]++;
} else {
    $data[$path][$date] = 1;
}

Nested Array Lookups Are Expensive

This one was a surprise. The benchmarks now showed that isset($data[$path][$date]) was the bottleneck.

Turns out that querying nested arrays is an expensive operation. It is much faster to create a single-dimensional array and re-assemble into a multidimensional array later.

$fp = fopen($inputPath, 'r');
stream_set_read_buffer($fp, 65536);

$data = [];

while (($line = stream_get_line($fp, 4096, "\n")) !== false) {
    $commaPos = strpos($line, ',', 19);
    $key = substr($line, 19, $commaPos + 11 - 19);

    if (isset($data[$key])) {
        $data[$key]++;
    } else {
        $data[$key] = 1;
    }
}

$nestedData = [];
foreach ($data as $key => $count) {
    [$path, $date] = explode(',', $key, 2);
    $nestedData[$path][$date] = $count;
}

foreach ($nestedData as &$values) {
    ksort($values);
}

file_put_contents($outputPath, json_encode($nestedData, JSON_PRETTY_PRINT));

explode Is Slower Than substr

Another minor improvement was to replace explode with substr calls. We can do this because the length of the date is fixed. I think it also helps that it eliminates two variable assignments.

Before:

foreach ($data as $key => $count) {
    [$path, $date] = explode(',', $key, 2);
    $nestedData[$path][$date] = $count;
}

After:

foreach ($data as $key => $count) {
    $nestedData[substr($key, 0, -11)][substr($key, -10)] = $count;
}

Final Benchmark

$fp = fopen($inputPath, 'r');
stream_set_read_buffer($fp, 65536);

$data = [];

$increments = 0;
$strs = 0;
$lineReads = 0;
$loop = -microtime(true);
while (true) {
    $lineRead = -microtime(true);
    $line = stream_get_line($fp, 4096, "\n");
    $lineRead += microtime(true);

    $lineReads += $lineRead;

    if ($line === false) {
        break;
    }

    $str = -microtime(true);
    $commaPos = strpos($line, ',', 19);
    $key = substr($line, 19, $commaPos + 11 - 19);
    $str += microtime(true);

    $strs += $str;

    $increment = -microtime(true);
    if (isset($data[$key])) {
        $data[$key]++;
    } else {
        $data[$key] = 1;
    }
    $increment += microtime(true);

    $increments += $increment;
}
$loop += microtime(true);

$nesting = -microtime(true);
$nestedData = [];
foreach ($data as $key => $count) {
    $nestedData[substr($key, 0, -11)][substr($key, -10)] = $count;
}
$nesting += microtime(true);

$sorting = -microtime(true);
foreach ($nestedData as &$values) {
    ksort($values);
}
$sorting += microtime(true);

$writing = -microtime(true);
file_put_contents($outputPath, json_encode($nestedData, JSON_PRETTY_PRINT));
$writing += microtime(true);

var_dump($loop, $lineReads, $strs, $increments, $nesting, $sorting, $writing);
float(7.213) // loop - inc benchmarking
float(1.068) // Reading line from disk
float(1.066) // Extracting date/path strings
float(3.703) // Incrementing array
float(0.133) // creating nested array
float(0.204) // sorting array
float(0.081) // writing JSON

Parallel processing

I think the next big performance jump would be to use multi-threading. There are two CPUs available.

I experimented with pcntl_fork but the overhead of combining the results from threads was slower than a single thread. I'm sure I'm doing something wrong here, I suspect I need more than two threads. Another time.

<?php

namespace App;

use Exception;

final class Parser
{
    public function parse(string $inputPath, string $outputPath): void
    {
        echo 1;
        $fp = fopen($inputPath, 'r');
        $bytes = filesize($inputPath);

        // get midpoint
        fseek($fp, (int)($bytes / 2));
        stream_get_line($fp, 1000, "\n");
        $midPointLineStartByte = ftell($fp);
        rewind($fp);


        $childDataOutputPath = tempnam(sys_get_temp_dir(), 'parser_');


        $pid = pcntl_fork();

        if ($pid === 0) {
            $data = $this->parseChunk($inputPath, 0, $midPointLineStartByte);
            file_put_contents($childDataOutputPath, serialize($data));
            exit(0);
        } else {
            $data = $this->parseChunk($inputPath, $midPointLineStartByte, $bytes);
        }

        pcntl_waitpid($pid, $status);

        $childData = unserialize(file_get_contents($childDataOutputPath));

        foreach ($childData as $path => $dates) {
            foreach ($dates as $date => $count) {
                $data[$path][$date] ??= 0;
                $data[$path][$date] += $count;
            }
        }

        foreach ($data as &$values) {
            ksort($values);
        }

        file_put_contents($outputPath, json_encode($data, JSON_PRETTY_PRINT));
    }

    private function parseChunk(string $inputPath, int $start, int $end): array
    {
        $fp = fopen($inputPath, 'r');
        fseek($fp, $start);

        $data = [];

        while (ftell($fp) < $end) {
            fseek($fp, strlen('https://stitcher.io'), SEEK_CUR);
            $path = stream_get_line($fp, 1000, ",");
            $date = stream_get_line($fp, 1000, "T");
            stream_get_line($fp, 1000, "\n");

            $data[$path][$date] ??= 0;
            $data[$path][$date]++;
        }

        fclose($fp);

        return $data;
    }
}