Skip to content

Add notebook comparing Pandas and Polars API examples #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.11.11","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":1138586,"sourceType":"datasetVersion","datasetId":4609}],"dockerImageVersionId":31012,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Pandas vs Polars API Comparison: Code Examples","metadata":{}},{"cell_type":"markdown","source":"This notebook provides the Python code examples discussed in the accompanying article comparing the APIs of the Pandas and Polars DataFrame libraries. It demonstrates syntax for common data manipulation tasks and includes the illustrative benchmark code.\n\nMake sure you have both libraries installed:","metadata":{}},{"cell_type":"code","source":"pip install pandas polars","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Let's start with the necessary imports.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nimport polars as pl\nimport time\nimport gc # Garbage collector\nimport numpy as np # For introducing nulls in Pandas example\n\nprint(f\"Pandas version: {pd.__version__}\")\nprint(f\"Polars version: {pl.__version__}\")","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"Now, let's compare the Pandas API and Polars API side-by-side for essential data manipulation tasks. We'll use illustrative code snippets, assuming you have imported the libraries as pd and pl respectively. \n\nNote: The following code placeholders assume DataFrames like `df_pandas`, `df_polars`, `df_left_pandas`, `df_right_polars`, etc., have been defined. You will need to define these yourself based on the context or previous examples if you wish to run these cells directly.","metadata":{}},{"cell_type":"markdown","source":"### Reading/Writing Data (CSV Example) \n\nLoading data from files and writing results back is a fundamental step.\n\n**Pandas**\n\nUses straightforward functions like `read_csv()`. Execution is eager.","metadata":{}},{"cell_type":"code","source":"# Reading (Eager)\ndf_pandas = pd.read_csv(\"data.csv\")\n\n# Writing\ndf_pandas.to_csv(\"output_pandas.csv\", index=False) # Must often disable index writing","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"**Polars**\n\nOffers both eager `read_csv()` and lazy `scan_csv()` options. Lazy scanning is highly recommended for large files as it allows optimizations before loading data into memory.","metadata":{}},{"cell_type":"code","source":"# Reading (Eager)\ndf_polars = pl.read_csv(\"data.csv\")\n\n# Reading (Lazy - Preferred for large files)\nlf_polars = pl.scan_csv(\"data.csv\")\n# ... other lazy operations on lf_polars ...\ndf_polars = lf_polars.collect() # Execute plan and load data\n\n# Writing\ndf_polars.write_csv(\"output_polars.csv\")\n# LazyFrames can also sink directly\n# lf_polars.sink_csv(\"output_lazy.csv\")","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"### Selection (Rows & Columns)\nSelecting specific subsets of your data is a core operation.\n\n**Pandas** \n\nOffers flexible selection using `[]`, label-based `.loc[]`, and integer position based `.iloc[]`.","metadata":{}},{"cell_type":"code","source":"# Select single column 'A'\ncol_a_pandas = df_pandas['A']\n\n# Select multiple columns\nsubset_cols_pandas = df_pandas[['A', 'B']]\n\n# Select rows by integer position (slicing)\nsubset_rows_pandas = df_pandas.iloc[5:10]\n\n# Select rows and columns by label/position\nsubset_loc_pandas = df_pandas.loc[df_pandas['index_col'] == 'label', ['A', 'B']] # Using .loc\nsubset_iloc_pandas = df_pandas.iloc[5:10, [0, 2]] # Using .iloc","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"**Polars**\n\n\nUses `select()` for columns, `filter()` for row conditions (see next section), and `[]` for row slicing or selecting columns by name.","metadata":{}},{"cell_type":"code","source":"# Select single column 'A' (returns Series)\ncol_a_polars = df_polars['A']\n# More explicit way using select (returns DataFrame)\ncol_a_df_polars = df_polars.select(pl.col('A'))\n\n# Select multiple columns\nsubset_cols_polars = df_polars.select(['A', 'B'])\n# Using expressions\nsubset_cols_expr_polars = df_polars.select(pl.col('A'), pl.col('B'))\n\n# Select rows by integer position (slicing)\nsubset_rows_polars = df_polars[5:10]\n\n# Select rows and columns (usually involves chaining filter/select or slicing)\nsubset_filter_select_polars = df_polars.filter(pl.col('C') > 10).select(['A', 'B'])\nsubset_slice_select_polars = df_polars[5:10].select(['A', 'B'])","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"### Filtering Data\nSelecting rows based on conditions is essential for analysis.\n\n**Pandas**\n\nCommonly uses boolean masking or the `.query()` method.","metadata":{}},{"cell_type":"code","source":"# Boolean masking\nfiltered_pandas = df_pandas[df_pandas['A'] > 100]\nfiltered_multi_pandas = df_pandas[(df_pandas['A'] > 100) & (df_pandas['B'] == 'category1')]\n\n# Using .query()\nfiltered_query_pandas = df_pandas.query(\"A > 100 and B == 'category1'\")","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"**Polars**\n\nUses the `filter()` method with expressions.","metadata":{}},{"cell_type":"code","source":"# Using filter() with expressions\nfiltered_polars = df_polars.filter(pl.col('A') > 100)\nfiltered_multi_polars = df_polars.filter(\n (pl.col('A') > 100) & (pl.col('B') == 'category1')\n)","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"### Creating/Modifying Columns\nAdding new columns or changing existing ones based on calculations.\n\n**Pandas**\n\nDirect assignment (`df['new_col'] = ...`) is common; `.assign()` provides a method-chaining approach.","metadata":{}},{"cell_type":"code","source":"# Direct assignment\ndf_pandas['C'] = df_pandas['A'] * 10\ndf_pandas['D'] = df_pandas['A'] / df_pandas['B'] # Assumes numeric B\n\n# Using .assign()\ndf_pandas = df_pandas.assign(\n C = df_pandas['A'] * 10,\n D = lambda x: x['A'] / x['B'] # Can use functions\n)","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"**Polars**\n\nUses `with_columns()` which takes a list of expressions. Each expression typically defines the calculation and uses `.alias()` to name the new/modified column.","metadata":{}},{"cell_type":"code","source":"# Using with_columns()\ndf_polars = df_polars.with_columns([\n (pl.col('A') * 10).alias('C'),\n (pl.col('A') / pl.col('B')).alias('D') # Assumes numeric B\n])\n\n# Creating multiple columns, including conditional logic\ndf_polars = df_polars.with_columns([\n (pl.col('A') * 10).alias('C'),\n pl.when(pl.col('A') > 100)\n .then(pl.lit(\"High\")) # pl.lit() for literal values\n .otherwise(pl.lit(\"Low\"))\n .alias('A_category')\n])","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"### Grouping and Aggregation\nSummarizing data by groups is a cornerstone of analysis.\n\n**Pandas**\n\nUses the `groupby().agg()` pattern, often specifying aggregation functions as strings or using named aggregation.","metadata":{}},{"cell_type":"code","source":"# Group by 'group_col', calculate mean of 'A' and max of 'B'\nagg_pandas = df_pandas.groupby('group_col').agg(\n avg_A = ('A', 'mean'),\n max_B = ('B', 'max')\n)","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"**Polars**\n\nUses a similar `group_by().agg()` structure, but aggregations are defined using expressions.","metadata":{}},{"cell_type":"code","source":"# Group by 'group_col', calculate mean of 'A' and max of 'B'\nagg_polars = df_polars.group_by('group_col').agg([\n pl.mean('A').alias('avg_A'), # Use Polars aggregation functions\n pl.max('B').alias('max_B'),\n pl.count().alias('group_size'), # Count rows in each group\n pl.first('C').alias('first_C_in_group') # Get first value\n])","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"### Joining/Merging DataFrames\nCombining data from multiple sources based on common keys.\n\n**Pandas**\n\nUses the `pd.merge()` function or the DataFrame's `.join()` method (which often requires setting the index on one DataFrame).","metadata":{}},{"cell_type":"code","source":"# Assume df_left_pandas, df_right_pandas exist\nmerged_pandas = pd.merge(df_left_pandas, df_right_pandas, on='key_col', how='inner')\n\n# Using .join() - often needs index alignment\n# joined_pandas = df_left_pandas.join(df_right_pandas.set_index('key_col'), on='key_col', how='left')","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"**Polars**\n\nUses a single, powerful `.join()` method.","metadata":{}},{"cell_type":"code","source":"# Assume df_left_polars, df_right_polars exist\njoined_polars = df_left_polars.join(df_right_polars, on='key_col', how='inner')\n\n# Different join keys, left join\njoined_left_polars = df_left_polars.join(\n df_right_polars,\n left_on='left_key',\n right_on='right_key',\n how='left'\n)","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"### Handling Missing Data\nDealing with null or NaN values. (Requires DataFrames with nulls introduced manually).\n\n**Pandas**\n\nUses `.isnull()`, `.fillna()`, `.dropna()`.","metadata":{}},{"cell_type":"code","source":"# Check for nulls\nnulls_pandas = df_pandas['A'].isnull()\n\n# Fill nulls\nfilled_pandas = df_pandas.fillna(0) # Fill all with 0\nfilled_specific_pandas = df_pandas.fillna({'A': 0, 'B': 'Unknown'})\n\n# Drop rows with any nulls\ndropped_pandas = df_pandas.dropna()","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"**Polars**\n\nUses analogous methods `.is_null()`, `.fill_null()`, `.drop_nulls()`.","metadata":{}},{"cell_type":"code","source":"# Check for nulls (creates boolean Series/Expression)\nnulls_polars = df_polars['A'].is_null()\n# In an expression context: pl.col('A').is_null()\n\n# Fill nulls\nfilled_polars = df_polars.fill_null(0) # Fill all with 0\n# Fill specific columns or use strategies (expressions)\nfilled_strategy_polars = df_polars.with_columns([\n pl.col('A').fill_null(0),\n pl.col('B').fill_null(pl.lit('Unknown')),\n pl.col('C').fill_null(pl.median('C')) # Fill with column median\n])\n\n# Drop rows with any nulls\ndropped_polars = df_polars.drop_nulls()","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"### Applying Custom Functions\nFor operations not covered by built-in functions, applying custom Python logic.\n\n**Pandas**\n\nUses `.apply()` (can be row-wise or column-wise, often slow for rows) or `.map()` (element-wise on Series).","metadata":{}},{"cell_type":"code","source":"# Apply row-wise (use cautiously - performance)\ndf_pandas['custom_result'] = df_pandas.apply(lambda row: row['A'] + row['B'] if row['C'] else row['A'], axis=1)\n\n# Map element-wise\ndf_pandas['A_mapped'] = df_pandas['A'].map(lambda x: x**2)","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"**Polars**\n\nProvides `.map_elements()` for element-wise operations (faster than `.apply`, requires dtype specification) and `.apply()` (generally slow, breaks optimizations). Crucially, Polars strongly encourages using its built-in expressions whenever possible for performance.","metadata":{}},{"cell_type":"code","source":"# *** STRONGLY PREFER BUILT-IN EXPRESSIONS ***\n# Example equivalent to the pandas apply lambda using expressions:\ndf_polars = df_polars.with_columns(\n pl.when(pl.col('C'))\n .then(pl.col('A') + pl.col('B'))\n .otherwise(pl.col('A'))\n .alias('custom_result_expr')\n)\n\n# Map element-wise (use only if no expression exists)\ndf_polars = df_polars.with_columns(\n pl.col('A').map_elements(lambda x: x**2, return_dtype=pl.Float64).alias('A_mapped')\n # Need return_dtype, potentially slower than expressions\n)\n\n# Apply row-wise (AVOID if possible - significant performance cost)\n# df_polars['custom_apply'] = df_polars.apply(lambda row_dict: ...) # Slow!","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"### Illustrative Benchmark Example (Iowa Sales Data)\nThis section contains the benchmark code comparing Pandas and Polars on the Iowa Liquor Sales dataset. It performs reading, cleaning ('Sale (Dollars)'), grouping ('County'), and aggregation (mean sales).\n\nNote: Running this requires downloading the dataset (`Iowa_Liquor_Sales.csv`) and placing it at the specified path (`/kaggle/input/iowa-liquor-sales/Iowa_Liquor_Sales.csv` or adjusting the path). Results depend heavily on your environment. In a specific execution run shared, Polars completed the task in approximately 9.6 seconds, whereas Pandas took around 108.88 seconds – showcasing a dramatic difference for this workload.","metadata":{}},{"cell_type":"markdown","source":"**Polars**","metadata":{}},{"cell_type":"code","source":"import polars as pl\nimport time\nimport gc # Garbage collector\n\n# --- Polars Example: Iowa Liquor Sales Aggregation ---\n# Reads the data, cleans the sales column, groups by county, and calculates mean sales.\n\nprint(\"--- Running Polars Iowa Sales Aggregation Example ---\")\n\n# Define the path to your data file\n# Make sure this path is correct for your environment\ncsv_file_path = \"/kaggle/input/iowa-liquor-sales/Iowa_Liquor_Sales.csv\"\n\n# Define relevant column names\nsales_col = 'Sale (Dollars)'\ncounty_col = 'County'\navg_sales_col = 'Average Sale (Dollars)' # Name for the aggregated column\n\ntry:\n # Record start time\n start_time = time.time()\n\n # --- Build the Polars Lazy Query ---\n # 1. Scan the CSV lazily\n lf = pl.scan_csv(csv_file_path)\n\n # 2. Clean the 'Sale (Dollars)' column and cast to Float64\n # This expression overwrites the original column\n lf = lf.with_columns(\n pl.col(sales_col)\n .cast(pl.Utf8) # Ensure it's string type first\n .str.replace(r\"\\$\", \"\") # Remove the '$' character\n .cast(pl.Float64) # Cast the cleaned string to Float64\n )\n\n # 3. Group by 'County' and calculate the mean of the cleaned 'Sale (Dollars)'\n lf_agg = lf.group_by(county_col).agg(\n pl.mean(sales_col).alias(avg_sales_col) # Calculate mean and rename\n )\n\n # 4. Execute the entire lazy plan\n result_df = lf_agg.collect()\n\n # Record end time\n end_time = time.time()\n duration = end_time - start_time\n\n # --- Output Results ---\n print(f\"Polars operation took: {duration:.4f} seconds\")\n print(\"\\nAggregation Result (Top 5):\")\n print(result_df.head()) # Display the first few rows of the result\n\n # --- Clean up memory ---\n del lf # LazyFrame reference\n del lf_agg # LazyFrame reference\n del result_df # DataFrame reference\n gc.collect()\n\n# --- Error Handling ---\nexcept FileNotFoundError:\n print(f\"\\nError: Data file not found at {csv_file_path}\")\n print(\"Please ensure the file path is correct.\")\nexcept Exception as e:\n # Catch other potential errors during processing (e.g., column not found)\n print(f\"\\nAn error occurred during Polars processing: {e}\")","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"code","source":"import pandas as pd\nimport time\nimport gc # Garbage collector\n\n# --- Pandas Example: Iowa Liquor Sales Aggregation ---\n# Reads the data, cleans the sales column, groups by county, and calculates mean sales.\n# Note: Pandas operations are typically eager.\n\nprint(\"--- Running Pandas Iowa Sales Aggregation Example ---\")\n\n# Define the path to your data file\n# Make sure this path is correct for your environment\ncsv_file_path = \"/kaggle/input/iowa-liquor-sales/Iowa_Liquor_Sales.csv\"\n\n# Define relevant column names\nsales_col = 'Sale (Dollars)'\ncounty_col = 'County'\navg_sales_col = 'Average Sale (Dollars)' # Name for the aggregated column\n\ntry:\n # Record start time\n start_time = time.time()\n\n # --- Perform Pandas Operations Eagerly ---\n # 1. Read the entire CSV into memory\n df = pd.read_csv(csv_file_path)\n\n # 2. Clean the 'Sale (Dollars)' column and cast to float\n # Ensure the column exists before attempting cleaning\n if sales_col in df.columns:\n # Remove '$' and convert to numeric (float)\n # Using errors='coerce' will turn problematic values into NaN\n df[sales_col] = pd.to_numeric(\n df[sales_col].astype(str).str.replace(r'\\$', '', regex=True),\n errors='coerce'\n )\n # Alternative using .astype after replace:\n # df[sales_col] = df[sales_col].astype(str).str.replace(r'\\$', '', regex=True).astype(float)\n else:\n raise ValueError(f\"Column '{sales_col}' not found in the CSV.\")\n\n # 3. Group by 'County' and calculate the mean of the cleaned 'Sale (Dollars)'\n # Ensure the county column exists\n if county_col in df.columns:\n # Group, aggregate, and reset index to make 'County' a column again\n result_df = df.groupby(county_col)[sales_col].mean().reset_index()\n # Rename the aggregated column for clarity\n result_df = result_df.rename(columns={sales_col: avg_sales_col})\n else:\n raise ValueError(f\"Column '{county_col}' not found in the CSV.\")\n\n # Record end time\n end_time = time.time()\n duration = end_time - start_time\n\n # --- Output Results ---\n print(f\"Pandas operation took: {duration:.4f} seconds\")\n print(\"\\nAggregation Result (Top 5):\")\n print(result_df.head()) # Display the first few rows of the result\n\n # --- Clean up memory ---\n del df # DataFrame reference\n del result_df # DataFrame reference\n gc.collect()\n\n# --- Error Handling ---\nexcept FileNotFoundError:\n print(f\"\\nError: Data file not found at {csv_file_path}\")\n print(\"Please ensure the file path is correct.\")\nexcept ValueError as ve:\n # Catch specific errors like missing columns\n print(f\"\\nData Error: {ve}\")\nexcept Exception as e:\n # Catch other potential errors during processing\n print(f\"\\nAn error occurred during Pandas processing: {e}\")\n","metadata":{"trusted":true},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"This notebook provides a practical reference for the code used in the Pandas vs Polars API comparison. Remember that the best choice of library depends on your specific needs regarding performance, data size, and ecosystem integration.","metadata":{}}]}