Parquet.Net 4.1.3 vs Parquet.Sharp 8.0.0 Performance

The official ParquetSharp docs claim that it’s about 8 times faster than Parquet.Net. It’s understandable, as ParquetSharp is a wrapper around C++ implementation. But I wanted to check myself.

Benchmarking was done properly (code below), i.e. data was prepared before running benchmarks, and everyone had an identical job to do - read and write files with different data types.

Tested on Windows 11 rig (Surface Book 3 i7). Here are raw

Results

MethodDataTypeDataSizeModeMeanErrorStdDevGen0Gen1Gen2Allocated
ParquetNetfloat10read73.88 μs15.766 μs0.864 μs2.8076--11.49 KB
ParquetSharpfloat10read100.49 μs20.178 μs1.106 μs9.03321.7090-37.23 KB
ParquetNetfloat10write396.30 μs126.413 μs6.929 μs2.4414--11.63 KB
ParquetSharpfloat10write491.33 μs259.184 μs14.207 μs4.8828--20.63 KB
ParquetNetfloat100read75.12 μs17.052 μs0.935 μs2.9297--12.19 KB
ParquetSharpfloat100read100.42 μs4.586 μs0.251 μs9.15531.8311-37.58 KB
ParquetNetfloat100write534.55 μs566.419 μs31.047 μs3.4180--14 KB
ParquetSharpfloat100write624.27 μs542.110 μs29.715 μs4.8828--20.63 KB
ParquetNetfloat1000read173.66 μs14.614 μs0.801 μs5.6152--22.15 KB
ParquetSharpfloat1000read101.96 μs9.078 μs0.498 μs10.00981.9531-41.09 KB
ParquetNetfloat1000write576.80 μs70.744 μs3.878 μs9.7656--40.15 KB
ParquetSharpfloat1000write592.69 μs562.047 μs30.808 μs4.8828--20.63 KB
ParquetNetfloat1000000read12,020.33 μs23,014.402 μs1,261.497 μs156.2500156.2500156.25007827.37 KB
ParquetSharpfloat1000000read2,263.07 μs935.326 μs51.268 μs125.0000117.1875117.18753944.02 KB
ParquetNetfloat1000000write40,940.85 μs26,413.129 μs1,447.793 μs692.3077692.3077692.307730277.95 KB
ParquetSharpfloat1000000write45,095.70 μs135,818.351 μs7,444.662 μs---20.75 KB
ParquetNetint10read90.33 μs163.148 μs8.943 μs2.6855--11.49 KB
ParquetSharpint10read101.88 μs22.113 μs1.212 μs9.15531.8311-37.6 KB
ParquetNetint10write525.79 μs65.553 μs3.593 μs2.4414--11.63 KB
ParquetSharpint10write553.13 μs86.676 μs4.751 μs4.8828--20.88 KB
ParquetNetint100read75.46 μs9.007 μs0.494 μs2.9297--12.19 KB
ParquetSharpint100read112.53 μs7.467 μs0.409 μs9.15531.8311-37.95 KB
ParquetNetint100write539.13 μs894.439 μs49.027 μs3.4180--14 KB
ParquetSharpint100write640.44 μs203.358 μs11.147 μs4.8828--20.88 KB
ParquetNetint1000read175.95 μs30.184 μs1.655 μs5.6152--22.15 KB
ParquetSharpint1000read106.26 μs16.655 μs0.913 μs10.13181.9531-41.47 KB
ParquetNetint1000write600.46 μs203.480 μs11.153 μs9.7656--40.15 KB
ParquetSharpint1000write747.07 μs300.856 μs16.491 μs4.8828--20.88 KB
ParquetNetint1000000read11,774.15 μs7,818.750 μs428.572 μs156.2500156.2500156.25007827.9 KB
ParquetSharpint1000000read2,863.95 μs559.571 μs30.672 μs117.1875109.3750109.37503944.37 KB
ParquetNetint1000000write37,166.73 μs17,563.943 μs962.739 μs714.2857714.2857714.285730277.97 KB
ParquetSharpint1000000write24,173.06 μs13,369.745 μs732.841 μs---20.94 KB
ParquetNetstr10read82.11 μs40.593 μs2.225 μs3.9063--16.28 KB
ParquetSharpstr10read122.56 μs22.451 μs1.231 μs26.85556.5918-112.05 KB
ParquetNetstr10write585.58 μs153.079 μs8.391 μs3.9063--17.11 KB
ParquetSharpstr10write619.85 μs421.429 μs23.100 μs19.53133.9063-81.14 KB
ParquetNetstr100read197.51 μs102.508 μs5.619 μs13.9160--50.01 KB
ParquetSharpstr100read130.14 μs20.777 μs1.139 μs32.226610.4980-132.45 KB
ParquetNetstr100write2,023.54 μs2,064.824 μs113.180 μs13.6719--56.64 KB
ParquetSharpstr100write795.58 μs2,056.523 μs112.725 μs20.50783.9063-87.21 KB
ParquetNetstr1000read514.14 μs537.780 μs29.478 μs64.453132.226632.2266355.95 KB
ParquetSharpstr1000read236.97 μs298.893 μs16.383 μs71.289123.4375-336.35 KB
ParquetNetstr1000write1,378.53 μs1,838.067 μs100.751 μs72.265672.265672.2656395.21 KB
ParquetSharpstr1000write1,005.94 μs874.443 μs47.931 μs48.82819.7656-206.29 KB
ParquetNetstr1000000read373,941.10 μs176,271.747 μs9,662.049 μs36000.000018000.00001000.0000339864.13 KB
ParquetSharpstr1000000read349,504.53 μs192,279.577 μs10,539.492 μs36000.000018000.00001000.0000226691.53 KB
ParquetNetstr1000000write623,253.27 μs369,838.885 μs20,272.117 μs---390525.34 KB
ParquetSharpstr1000000write262,019.10 μs247,550.530 μs13,569.080 μs11500.00001000.0000-110924.43 KB

Summary

On smaller datasets Parquet.Net is actually 10-30% faster which is explainable due to code being well written. On large dataset i.e 1 million rows results are:

  • integer type: reading is 4 times slower than ParquetSharp; writing is 1.5 slower.
  • float type: reading is 5 times slower; writing is almost identical.
  • string: reading is almost identical; writing is 2.3 times slower.

So yes, Parquet.Net 4.1.3 is slower, but not 8 times. Also you get safe code implemented purely in .NET and working on all platforms unlike the other.

But I’d watch this space, because Parquet.Net 4.2 will start bringing some performance improvements which may put it on par or even make it faster. The main reason we are slow is due to relying on BinaryWriter and BinaryReader binary encoders which are not optimised for heavy workloads. Early tests show that eliminating them makes Parquet.Net faster than ParquetSharp, so these improvements may come in soon.

Benchmarking Code

#LINQPad optimize+

void Main()
{
	Util.AutoScrollResults = true;
	
	// generate int and string samples
	
	BenchmarkRunner.Run<ParquetBenches>();
}

[ShortRunJob]
[MarkdownExporter]
[MemoryDiagnoser]
public class ParquetBenches {
    public string ParquetNetFilename;
    public string ParquetSharpFilename;

    [Params("int", "str", "float")]
    public string DataType;

    [Params(10, 100, 1000, 1000000)]
    //[Params(10)]
    public int DataSize;
    
    [Params("write", "read")]
    public string Mode;
    

    private static Random random = new Random();
    public static string RandomString(int length) {
        const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        return new string(Enumerable.Repeat(chars, length)
          .Select(s => s[random.Next(s.Length)]).ToArray());
    }

    private DataColumn _pqnc;
    private Schema _pqns;
    private MemoryStream _pnqMs;

    private Column[] _pqss;
    private object _pqsd;
    private Action<RowGroupWriter> _pqsWriteAction;
    private Action<RowGroupReader, int> _pqsReadAction;

    [GlobalSetup]
    public async Task Setup() {
        switch (DataType) {
            case "int":
                _pqnc = new DataColumn(new DataField<int>("c"), Enumerable.Range(0, DataSize).ToArray());
                _pqss = new Column[] { new Column<int>("c") };
                _pqsd = (int[])_pqnc.Data;
                _pqsWriteAction = w => {
                    using (var colWriter = w.NextColumn().LogicalWriter<int>()) {
                        colWriter.WriteBatch((int[])_pqsd);
                    }
                };
                _pqsReadAction = (r, n) => {
                    var data = r.Column(0).LogicalReader<int>().ReadAll(n);
                };
                break;
            case "str":
                _pqnc = new DataColumn(new DataField<string>("c"), Enumerable.Range(0, DataSize).Select(i => RandomString(100)).ToArray());
                _pqss = new Column[] { new Column<string>("c") };
                _pqsd = (string[])_pqnc.Data;
                _pqsWriteAction = w => {
                    using (var colWriter = w.NextColumn().LogicalWriter<string>()) {
                        colWriter.WriteBatch((string[])_pqsd);
                    }
                };
                _pqsReadAction = (r, n) => {
                    var data = r.Column(0).LogicalReader<string>().ReadAll(n);
                };
                break;
            case "float":
                _pqnc = new DataColumn(
                    new DataField<float>("f"), Enumerable.Range(0, DataSize).Select(i => (float)i).ToArray());
                _pqss = new Column[] { new Column<float>("f") };
                _pqsd = (float[])_pqnc.Data;
                _pqsWriteAction = w => {
                    using (var colWriter = w.NextColumn().LogicalWriter<float>()) {
                        colWriter.WriteBatch((float[])_pqsd);
                    }
                };
                _pqsReadAction = (r, n) => {
                    var data = r.Column(0).LogicalReader<float>().ReadAll(n);
                };

                break;
            case "date":
                _pqnc = new DataColumn(
                    new DataField<DateTimeOffset>("dto"),
                    Enumerable.Range(0, DataSize).Select(i => (DateTimeOffset)DateTime.UtcNow.AddSeconds(i)).ToArray());
                _pqss = new Column[] { new Column<DateTimeOffset>("dto") };
                _pqsd = (DateTimeOffset[])_pqnc.Data;
                _pqsWriteAction = w => {
                    using (var colWriter = w.NextColumn().LogicalWriter<DateTimeOffset>()) {
                        colWriter.WriteBatch((DateTimeOffset[])_pqsd);
                    }
                };
                _pqsReadAction = (r, n) => {
                    var data = r.Column(0).LogicalReader<DateTimeOffset>().ReadAll(n);
                };

                break;


            default:
                throw new NotImplementedException();
        }

        _pqns = new Schema(_pqnc.Field);
        _pnqMs = new MemoryStream(1000);

        ParquetNetFilename = $"c:\\tmp\\parq_net_benchmark_{Mode}_{DataSize}_{DataType}.parquet";
        ParquetSharpFilename = $"c:\\tmp\\parq_sharp_benchmark_{Mode}_{DataSize}_{DataType}.parquet";

        if (Mode == "read") {
            using (Stream fileStream = File.Create(ParquetNetFilename)) {
                using (var writer = await ParquetWriter.CreateAsync(_pqns, fileStream)) {
                    writer.CompressionMethod = CompressionMethod.None;
                    // create a new row group in the file
                    using (ParquetRowGroupWriter groupWriter = writer.CreateRowGroup()) {
                        await groupWriter.WriteColumnAsync(_pqnc);
                    }
                }
            }
        }
    }

    [Benchmark]
    public async Task ParquetNet() {
        if (Mode == "write") {
            using (Stream fileStream = File.Create(ParquetNetFilename)) {
                using (var writer = await ParquetWriter.CreateAsync(_pqns, fileStream)) {
                    writer.CompressionMethod = CompressionMethod.None;
                    // create a new row group in the file
                    using (ParquetRowGroupWriter groupWriter = writer.CreateRowGroup()) {
                        await groupWriter.WriteColumnAsync(_pqnc);
                    }
                }
            }
        } else if(Mode == "read") {
            using(var reader = await ParquetReader.CreateAsync(ParquetNetFilename)) {
                await reader.ReadEntireRowGroupAsync();
            }
        }
    }

    [Benchmark]
    public async Task ParquetSharp() {
        //https://github.com/G-Research/ParquetSharp#low-level-api

        if (Mode == "write") {
            using (var writer = new ParquetFileWriter(ParquetSharpFilename, _pqss, Compression.Uncompressed)) {
                using (RowGroupWriter rowGroup = writer.AppendRowGroup()) {
                    _pqsWriteAction(rowGroup);
                }
            }
        } else if(Mode == "read") {
            using(var reader = new ParquetFileReader(ParquetNetFilename)) {
                using(var g = reader.RowGroup(0)) {
                    int n = checked((int) g.MetaData.NumRows);
                    _pqsReadAction(g, n);
                }
            }
        }

    }
}


To contact me, send an email anytime or leave a comment below.