Parquet.Net 4.1.3 vs Parquet.Sharp 8.0.0 Performance

The official ParquetSharp docs claim that it’s about 8 times faster than Parquet.Net. It’s understandable, as ParquetSharp is a wrapper around C++ implementation. But I wanted to check myself.

Benchmarking was done properly (code below), i.e. data was prepared before running benchmarks, and everyone had an identical job to do - read and write files with different data types.

Tested on Windows 11 rig (Surface Book 3 i7). Here are raw

Results

Method DataType DataSize Mode Mean Error StdDev Gen0 Gen1 Gen2 Allocated
ParquetNet float 10 read 73.88 μs 15.766 μs 0.864 μs 2.8076 - - 11.49 KB
ParquetSharp float 10 read 100.49 μs 20.178 μs 1.106 μs 9.0332 1.7090 - 37.23 KB
ParquetNet float 10 write 396.30 μs 126.413 μs 6.929 μs 2.4414 - - 11.63 KB
ParquetSharp float 10 write 491.33 μs 259.184 μs 14.207 μs 4.8828 - - 20.63 KB
ParquetNet float 100 read 75.12 μs 17.052 μs 0.935 μs 2.9297 - - 12.19 KB
ParquetSharp float 100 read 100.42 μs 4.586 μs 0.251 μs 9.1553 1.8311 - 37.58 KB
ParquetNet float 100 write 534.55 μs 566.419 μs 31.047 μs 3.4180 - - 14 KB
ParquetSharp float 100 write 624.27 μs 542.110 μs 29.715 μs 4.8828 - - 20.63 KB
ParquetNet float 1000 read 173.66 μs 14.614 μs 0.801 μs 5.6152 - - 22.15 KB
ParquetSharp float 1000 read 101.96 μs 9.078 μs 0.498 μs 10.0098 1.9531 - 41.09 KB
ParquetNet float 1000 write 576.80 μs 70.744 μs 3.878 μs 9.7656 - - 40.15 KB
ParquetSharp float 1000 write 592.69 μs 562.047 μs 30.808 μs 4.8828 - - 20.63 KB
ParquetNet float 1000000 read 12,020.33 μs 23,014.402 μs 1,261.497 μs 156.2500 156.2500 156.2500 7827.37 KB
ParquetSharp float 1000000 read 2,263.07 μs 935.326 μs 51.268 μs 125.0000 117.1875 117.1875 3944.02 KB
ParquetNet float 1000000 write 40,940.85 μs 26,413.129 μs 1,447.793 μs 692.3077 692.3077 692.3077 30277.95 KB
ParquetSharp float 1000000 write 45,095.70 μs 135,818.351 μs 7,444.662 μs - - - 20.75 KB
ParquetNet int 10 read 90.33 μs 163.148 μs 8.943 μs 2.6855 - - 11.49 KB
ParquetSharp int 10 read 101.88 μs 22.113 μs 1.212 μs 9.1553 1.8311 - 37.6 KB
ParquetNet int 10 write 525.79 μs 65.553 μs 3.593 μs 2.4414 - - 11.63 KB
ParquetSharp int 10 write 553.13 μs 86.676 μs 4.751 μs 4.8828 - - 20.88 KB
ParquetNet int 100 read 75.46 μs 9.007 μs 0.494 μs 2.9297 - - 12.19 KB
ParquetSharp int 100 read 112.53 μs 7.467 μs 0.409 μs 9.1553 1.8311 - 37.95 KB
ParquetNet int 100 write 539.13 μs 894.439 μs 49.027 μs 3.4180 - - 14 KB
ParquetSharp int 100 write 640.44 μs 203.358 μs 11.147 μs 4.8828 - - 20.88 KB
ParquetNet int 1000 read 175.95 μs 30.184 μs 1.655 μs 5.6152 - - 22.15 KB
ParquetSharp int 1000 read 106.26 μs 16.655 μs 0.913 μs 10.1318 1.9531 - 41.47 KB
ParquetNet int 1000 write 600.46 μs 203.480 μs 11.153 μs 9.7656 - - 40.15 KB
ParquetSharp int 1000 write 747.07 μs 300.856 μs 16.491 μs 4.8828 - - 20.88 KB
ParquetNet int 1000000 read 11,774.15 μs 7,818.750 μs 428.572 μs 156.2500 156.2500 156.2500 7827.9 KB
ParquetSharp int 1000000 read 2,863.95 μs 559.571 μs 30.672 μs 117.1875 109.3750 109.3750 3944.37 KB
ParquetNet int 1000000 write 37,166.73 μs 17,563.943 μs 962.739 μs 714.2857 714.2857 714.2857 30277.97 KB
ParquetSharp int 1000000 write 24,173.06 μs 13,369.745 μs 732.841 μs - - - 20.94 KB
ParquetNet str 10 read 82.11 μs 40.593 μs 2.225 μs 3.9063 - - 16.28 KB
ParquetSharp str 10 read 122.56 μs 22.451 μs 1.231 μs 26.8555 6.5918 - 112.05 KB
ParquetNet str 10 write 585.58 μs 153.079 μs 8.391 μs 3.9063 - - 17.11 KB
ParquetSharp str 10 write 619.85 μs 421.429 μs 23.100 μs 19.5313 3.9063 - 81.14 KB
ParquetNet str 100 read 197.51 μs 102.508 μs 5.619 μs 13.9160 - - 50.01 KB
ParquetSharp str 100 read 130.14 μs 20.777 μs 1.139 μs 32.2266 10.4980 - 132.45 KB
ParquetNet str 100 write 2,023.54 μs 2,064.824 μs 113.180 μs 13.6719 - - 56.64 KB
ParquetSharp str 100 write 795.58 μs 2,056.523 μs 112.725 μs 20.5078 3.9063 - 87.21 KB
ParquetNet str 1000 read 514.14 μs 537.780 μs 29.478 μs 64.4531 32.2266 32.2266 355.95 KB
ParquetSharp str 1000 read 236.97 μs 298.893 μs 16.383 μs 71.2891 23.4375 - 336.35 KB
ParquetNet str 1000 write 1,378.53 μs 1,838.067 μs 100.751 μs 72.2656 72.2656 72.2656 395.21 KB
ParquetSharp str 1000 write 1,005.94 μs 874.443 μs 47.931 μs 48.8281 9.7656 - 206.29 KB
ParquetNet str 1000000 read 373,941.10 μs 176,271.747 μs 9,662.049 μs 36000.0000 18000.0000 1000.0000 339864.13 KB
ParquetSharp str 1000000 read 349,504.53 μs 192,279.577 μs 10,539.492 μs 36000.0000 18000.0000 1000.0000 226691.53 KB
ParquetNet str 1000000 write 623,253.27 μs 369,838.885 μs 20,272.117 μs - - - 390525.34 KB
ParquetSharp str 1000000 write 262,019.10 μs 247,550.530 μs 13,569.080 μs 11500.0000 1000.0000 - 110924.43 KB

Summary

On smaller datasets Parquet.Net is actually 10-30% faster which is explainable due to code being well written. On large dataset i.e 1 million rows results are:

  • integer type: reading is 4 times slower than ParquetSharp; writing is 1.5 slower.
  • float type: reading is 5 times slower; writing is almost identical.
  • string: reading is almost identical; writing is 2.3 times slower.

So yes, Parquet.Net 4.1.3 is slower, but not 8 times. Also you get safe code implemented purely in .NET and working on all platforms unlike the other.

But I’d watch this space, because Parquet.Net 4.2 will start bringing some performance improvements which may put it on par or even make it faster. The main reason we are slow is due to relying on BinaryWriter and BinaryReader binary encoders which are not optimised for heavy workloads. Early tests show that eliminating them makes Parquet.Net faster than ParquetSharp, so these improvements may come in soon.

Benchmarking Code

#LINQPad optimize+

void Main()
{
	Util.AutoScrollResults = true;
	
	// generate int and string samples
	
	BenchmarkRunner.Run<ParquetBenches>();
}

[ShortRunJob]
[MarkdownExporter]
[MemoryDiagnoser]
public class ParquetBenches {
    public string ParquetNetFilename;
    public string ParquetSharpFilename;

    [Params("int", "str", "float")]
    public string DataType;

    [Params(10, 100, 1000, 1000000)]
    //[Params(10)]
    public int DataSize;
    
    [Params("write", "read")]
    public string Mode;
    

    private static Random random = new Random();
    public static string RandomString(int length) {
        const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        return new string(Enumerable.Repeat(chars, length)
          .Select(s => s[random.Next(s.Length)]).ToArray());
    }

    private DataColumn _pqnc;
    private Schema _pqns;
    private MemoryStream _pnqMs;

    private Column[] _pqss;
    private object _pqsd;
    private Action<RowGroupWriter> _pqsWriteAction;
    private Action<RowGroupReader, int> _pqsReadAction;

    [GlobalSetup]
    public async Task Setup() {
        switch (DataType) {
            case "int":
                _pqnc = new DataColumn(new DataField<int>("c"), Enumerable.Range(0, DataSize).ToArray());
                _pqss = new Column[] { new Column<int>("c") };
                _pqsd = (int[])_pqnc.Data;
                _pqsWriteAction = w => {
                    using (var colWriter = w.NextColumn().LogicalWriter<int>()) {
                        colWriter.WriteBatch((int[])_pqsd);
                    }
                };
                _pqsReadAction = (r, n) => {
                    var data = r.Column(0).LogicalReader<int>().ReadAll(n);
                };
                break;
            case "str":
                _pqnc = new DataColumn(new DataField<string>("c"), Enumerable.Range(0, DataSize).Select(i => RandomString(100)).ToArray());
                _pqss = new Column[] { new Column<string>("c") };
                _pqsd = (string[])_pqnc.Data;
                _pqsWriteAction = w => {
                    using (var colWriter = w.NextColumn().LogicalWriter<string>()) {
                        colWriter.WriteBatch((string[])_pqsd);
                    }
                };
                _pqsReadAction = (r, n) => {
                    var data = r.Column(0).LogicalReader<string>().ReadAll(n);
                };
                break;
            case "float":
                _pqnc = new DataColumn(
                    new DataField<float>("f"), Enumerable.Range(0, DataSize).Select(i => (float)i).ToArray());
                _pqss = new Column[] { new Column<float>("f") };
                _pqsd = (float[])_pqnc.Data;
                _pqsWriteAction = w => {
                    using (var colWriter = w.NextColumn().LogicalWriter<float>()) {
                        colWriter.WriteBatch((float[])_pqsd);
                    }
                };
                _pqsReadAction = (r, n) => {
                    var data = r.Column(0).LogicalReader<float>().ReadAll(n);
                };

                break;
            case "date":
                _pqnc = new DataColumn(
                    new DataField<DateTimeOffset>("dto"),
                    Enumerable.Range(0, DataSize).Select(i => (DateTimeOffset)DateTime.UtcNow.AddSeconds(i)).ToArray());
                _pqss = new Column[] { new Column<DateTimeOffset>("dto") };
                _pqsd = (DateTimeOffset[])_pqnc.Data;
                _pqsWriteAction = w => {
                    using (var colWriter = w.NextColumn().LogicalWriter<DateTimeOffset>()) {
                        colWriter.WriteBatch((DateTimeOffset[])_pqsd);
                    }
                };
                _pqsReadAction = (r, n) => {
                    var data = r.Column(0).LogicalReader<DateTimeOffset>().ReadAll(n);
                };

                break;


            default:
                throw new NotImplementedException();
        }

        _pqns = new Schema(_pqnc.Field);
        _pnqMs = new MemoryStream(1000);

        ParquetNetFilename = $"c:\\tmp\\parq_net_benchmark_{Mode}_{DataSize}_{DataType}.parquet";
        ParquetSharpFilename = $"c:\\tmp\\parq_sharp_benchmark_{Mode}_{DataSize}_{DataType}.parquet";

        if (Mode == "read") {
            using (Stream fileStream = File.Create(ParquetNetFilename)) {
                using (var writer = await ParquetWriter.CreateAsync(_pqns, fileStream)) {
                    writer.CompressionMethod = CompressionMethod.None;
                    // create a new row group in the file
                    using (ParquetRowGroupWriter groupWriter = writer.CreateRowGroup()) {
                        await groupWriter.WriteColumnAsync(_pqnc);
                    }
                }
            }
        }
    }

    [Benchmark]
    public async Task ParquetNet() {
        if (Mode == "write") {
            using (Stream fileStream = File.Create(ParquetNetFilename)) {
                using (var writer = await ParquetWriter.CreateAsync(_pqns, fileStream)) {
                    writer.CompressionMethod = CompressionMethod.None;
                    // create a new row group in the file
                    using (ParquetRowGroupWriter groupWriter = writer.CreateRowGroup()) {
                        await groupWriter.WriteColumnAsync(_pqnc);
                    }
                }
            }
        } else if(Mode == "read") {
            using(var reader = await ParquetReader.CreateAsync(ParquetNetFilename)) {
                await reader.ReadEntireRowGroupAsync();
            }
        }
    }

    [Benchmark]
    public async Task ParquetSharp() {
        //https://github.com/G-Research/ParquetSharp#low-level-api

        if (Mode == "write") {
            using (var writer = new ParquetFileWriter(ParquetSharpFilename, _pqss, Compression.Uncompressed)) {
                using (RowGroupWriter rowGroup = writer.AppendRowGroup()) {
                    _pqsWriteAction(rowGroup);
                }
            }
        } else if(Mode == "read") {
            using(var reader = new ParquetFileReader(ParquetNetFilename)) {
                using(var g = reader.RowGroup(0)) {
                    int n = checked((int) g.MetaData.NumRows);
                    _pqsReadAction(g, n);
                }
            }
        }

    }
}

Thanks! You can always email me or use contact form for more questions/comments etc.