Load a CSV file into AWS Athena for SQL Analysis

This tutorial will teach you how to load a CSV file into AWS Athena so that you can analyze it using SQL queries. This process will work for any CSV file, but in this tutorial we'll be using the free version of our World Cities Database. You could also use it with any of our many geographic databases. Let's get started

  1. Login or create an AWS Account.

  2. Download the Basic World Cities Database.
    This is free and won't require you to provide any information. However, attribution is required. After downloading, unzip the folder. We'll be using the worldcities.csv file.

  3. Clean and format the CSV as needed. The CSV file should be encoded as UTF-8. Due to the way Athena parses CSV files, we also need to remove all quotes from the CSV file. The easiest way to do this is to open the CSV file in LibreOffice: and then save it. Note the absence of quotation marks. If you just want the CSV file after this has been done, you can download it here. Note: Do not try to do this with Excel because it does not handle UTF-8 characters well.

  4. Create a new AWS S3 Bucket. We'll use the name simplemaps-example-athena. You will need to use a unique name of your own. Use all of the default settings.

  5. Create two folders in the bucket: input and queries. Athena will use the queries folder to store the queries it runs.

  6. Upload the worldcities.csv file to the input folder you just created.

  7. If you did things correctly, your bucket should look like this: and the input folder should look like this:

  8. Open Amazon Athena and click Explore the Query Editor.

  9. Go to the Settings tab and set the location of the query result to be s3://simplemaps-example-athena/queries. Except, of course, use your own unique bucket name. This is where Athena will store the queries that it runs.

  10. It's possible to import the data using the "Create" button, but we will import the data using a Query. Paste the following into the query area:

    CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`worldcities` (
      `city` string,
      `city_ascii` string,
      `lat` double,
      `lng` double,
      `country` string,
      `iso2` string,
      `iso3` string,
      `admin_name` string,
      `capital` string,
      `population` int,
      `id` int
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    WITH SERDEPROPERTIES (
     'serialization.format' = ',',
     'field.delim' = ','
    )
    LOCATION 's3://simplemaps-example-athena/input/'
    TBLPROPERTIES ('skip.header.line.count'='1')
    ;
    
    then click Run:

    A few things to note on what this does:

    • We have defined all of the fields in the dataset and given them an appropriate type.
    • We've told Athena to use the LazySimpleSerDe CSV parser. We use this parser because it allows Null values for numbers. It does not support quoted values, though, which is why we removed those earlier with LibreOffice. You can learn more about parsers in this helpful guide.
    • We've told the parser that fields are delimited by commas and that the first row contains fieldnames which can be skipped.
    • We've specified the location of the CSV file. We only need to provide the folder, not the file itself.
  11. If the import was successful, you'll see a green "Completed" message:
  12. Test the data with the following query:
    select city, population from worldcities where iso2='US' order by population desc limit 5;
    

    and you should get the five largest cities in the United States:

  13. Now you can query your CSV file using any other SQL queries you like!
  14. To learn how to run these queries over a url, see our next tutorial Query an Athena database over a URL with AWS Lambda.

Home (Maps) | Database License | Privacy | Database Releases | Database Order Lookup | Resources | Contact | FAQs
SimpleMaps.com is a product of Pareto Software, LLC. © 2010-2024.