1. Code Writing

Objective: Write a Python script for data analysis.

Prompt Example: 

  1. Use Pandas to load a CSV file. 

  1. Perform basic data cleaning (e.g., handling missing values, renaming columns). 

  1. Generate a summary of statistical data for numerical columns. 

  1. Create a data visualisation using Matplotlib. 

  1. Audience: University students and staff with a basic understanding of Python. 

  1. Expected Outcome: A clear and functional Python script that can be used as a template for data analysis tasks. 

 

Example User Query:  

Generate a Python script that performs the following tasks: 

1. Load a CSV file using Pandas. 

2. Clean the data by handling missing values and renaming columns. 

3. Generate a statistical summary for numerical columns. 

4. Create a data visualisation of the data using Matplotlib. 

The audience are University students and staff with a basic understanding of Python. The outcome should be a clear and functional Python script that can be used as a template for data analysis tasks. 

 

ELM’s Response:  

Below is a Python script designed to load a CSV file using Pandas, clean the data by handling missing values and renaming columns, generate a statistical summary for numerical columns, and create a data visualisation using Matplotlib. This script can serve as a template for basic data analysis tasks. 

# Import necessary libraries 

import pandas as pd 

import matplotlib.pyplot as plt 

 

# Step 1: Load a CSV file using Pandas 

def load_csv(file_path): 

    data = pd.read_csv(file_path) 

    return data 

 

# Step 2: Clean the data by handling missing values and renaming columns 

def clean_data(df): 

    # Handling missing values: Fill with the mean of the column 

    df.fillna(df.mean(), inplace=True) 

     

    # Renaming columns for better readability 

    df.rename(columns={ 

        'col1': 'Column1',  

        'col2': 'Column2',  

        'col3': 'Column3' 

    }, inplace=True) 

     

    return df 

 

# Step 3: Generate a statistical summary for numerical columns 

def generate_summary(df): 

    summary = df.describe() 

    print("Statistical Summary:") 

    print(summary) 

     

# Step 4: Create a data visualisation of the data using Matplotlib 

def create_visualisation(df): 

    plt.figure(figsize=(10, 6)) 

    # Example: Plotting a histogram of 'Column1' 

    plt.hist(df['Column1'], bins=30, edgecolor='k', alpha=0.7) 

    plt.title('Histogram of Column1') 

    plt.xlabel('Column1') 

    plt.ylabel('Frequency') 

    plt.show() 

# Main function to execute the steps 

def main(file_path): 

    # Load the data 

    data = load_csv(file_path) 

    # Clean the data 

    data_cleaned = clean_data(data) 

    # Generate and print the statistical summary 

    generate_summary(data_cleaned) 

    # Create a data visualisation 

    create_visualisation(data_cleaned) 

# Specify the path to your CSV file 

csv_file_path = 'path/to/your/data.csv' 

if __name__ == "__main__": 

    main(csv_file_path) 

 

Instructions: 

  1. Prepare your environment: Ensure you have Pandas and Matplotlib installed. You can install these packages using pip if you don't have them yet: 

pip install pandas matplotlib 

  1. Save the script: Copy and save the above script into a Python file, for example, data_analysis_template.py. 

  1. Run the script: Execute the script from the command line or your preferred Integrated Development Environment (IDE), passing the path to the CSV file you want to analyse. 

 

Notes: 

  • Column Renaming: Modify the df.rename function call to match the actual columns in your CSV file. 

  • Handling Missing Values: The example fills missing values with the mean of the respective columns. Depending on your dataset, you might choose different strategies like median, mode, or dropping missing values. 

  • Visualisation: The histogram example is created for Column1. You can customise this to create other types of plots and visualise different columns based on the requirements of your analysis. 

This template provides a foundation for data analysis and can be expanded with additional data cleaning, analysis, and visualisation steps as needed for more complex datasets.