Objective: Write a Python script for data analysis. Prompt Example: Use Pandas to load a CSV file. Perform basic data cleaning (e.g., handling missing values, renaming columns). Generate a summary of statistical data for numerical columns. Create a data visualisation using Matplotlib. Audience: University students and staff with a basic understanding of Python. Expected Outcome: A clear and functional Python script that can be used as a template for data analysis tasks. Example User Query: Generate a Python script that performs the following tasks: 1. Load a CSV file using Pandas. 2. Clean the data by handling missing values and renaming columns. 3. Generate a statistical summary for numerical columns. 4. Create a data visualisation of the data using Matplotlib. The audience are University students and staff with a basic understanding of Python. The outcome should be a clear and functional Python script that can be used as a template for data analysis tasks. ELM’s Response: Below is a Python script designed to load a CSV file using Pandas, clean the data by handling missing values and renaming columns, generate a statistical summary for numerical columns, and create a data visualisation using Matplotlib. This script can serve as a template for basic data analysis tasks. # Import necessary libraries import pandas as pd import matplotlib.pyplot as plt # Step 1: Load a CSV file using Pandas def load_csv(file_path): data = pd.read_csv(file_path) return data # Step 2: Clean the data by handling missing values and renaming columns def clean_data(df): # Handling missing values: Fill with the mean of the column df.fillna(df.mean(), inplace=True) # Renaming columns for better readability df.rename(columns={ 'col1': 'Column1', 'col2': 'Column2', 'col3': 'Column3' }, inplace=True) return df # Step 3: Generate a statistical summary for numerical columns def generate_summary(df): summary = df.describe() print("Statistical Summary:") print(summary) # Step 4: Create a data visualisation of the data using Matplotlib def create_visualisation(df): plt.figure(figsize=(10, 6)) # Example: Plotting a histogram of 'Column1' plt.hist(df['Column1'], bins=30, edgecolor='k', alpha=0.7) plt.title('Histogram of Column1') plt.xlabel('Column1') plt.ylabel('Frequency') plt.show() # Main function to execute the steps def main(file_path): # Load the data data = load_csv(file_path) # Clean the data data_cleaned = clean_data(data) # Generate and print the statistical summary generate_summary(data_cleaned) # Create a data visualisation create_visualisation(data_cleaned) # Specify the path to your CSV file csv_file_path = 'path/to/your/data.csv' if __name__ == "__main__": main(csv_file_path) Instructions: Prepare your environment: Ensure you have Pandas and Matplotlib installed. You can install these packages using pip if you don't have them yet: pip install pandas matplotlib Save the script: Copy and save the above script into a Python file, for example, data_analysis_template.py. Run the script: Execute the script from the command line or your preferred Integrated Development Environment (IDE), passing the path to the CSV file you want to analyse. Notes: Column Renaming: Modify the df.rename function call to match the actual columns in your CSV file. Handling Missing Values: The example fills missing values with the mean of the respective columns. Depending on your dataset, you might choose different strategies like median, mode, or dropping missing values. Visualisation: The histogram example is created for Column1. You can customise this to create other types of plots and visualise different columns based on the requirements of your analysis. This template provides a foundation for data analysis and can be expanded with additional data cleaning, analysis, and visualisation steps as needed for more complex datasets. This article was published on 2024-10-08