Downloading UK Postcode Database (Open Postcode) into Spark

Postcode data can be found on data.gov.uk web page, which in turn leads to a download page where you can get zipped CSV. Code below merely automates ingesting this data source into Spark.

import requests
import zipfile
import io

loc = "https://www.getthedata.com/downloads/open_postcode_geo.csv.zip"
zip_bin_data = requests.get(loc).content

byte_file = io.BytesIO(zip_bin_data)

with zipfile.ZipFile(byte_file, "r") as zip_ref:
    print(zip_ref.filelist)
    entry_name = zip_ref.filelist[0]
    csv_bin_data = zip_ref.read(entry_name)

csv_data = csv_bin_data.decode("utf-8")
lines = csv_data.splitlines()


dt = spark.sparkContext.parallelize(lines)
df = spark.read.csv(dt,
                    "postcode string, status string, usertype string, easting int, northing int, positional_quality_indicator int, country string, latitude decimal(25,20), longitude decimal(25,20), postcode_no_space string, postcode_fixed_width_seven string, postcode_fixed_width_eight string, postcode_area string, postcode_district string, postcode_sector string, outcode string, incode string")

display(df)

There are 2'581'934 records at the time of this writing.

License

Be aware that you need to use proper licensing as they mention:

Free to use for any purpose - attribution required.

Open Postcode Geo is derived from the ONS Postcode Directory which is licenced under the Open Government Licence and the Ordnance Survey OpenData Licence. Northern Irish postcodes have been removed as these are covered by a more restrictive licence. You may use the additional fields provided by GetTheData without restriction.

For details of the required attribution statements see the ONS Licences page.


To contact me, send an email anytime or leave a comment below.