Skip to content

Added 7.Known bugs section and pd.merge bug as 7.1 #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
298 changes: 297 additions & 1 deletion just-pandas-things.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3360,6 +3360,302 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 7. Known Bugs\n",
"### 7.1 `pd.merge()` joins over `np.nan` and `None`\n",
"\n",
"Joining mutiple dataframes on a common key/column to enrich your existing dataset is a common occurence and is usually done using `pd.merge()` in pandas. \n",
"\n",
"Here, due to current bugs, the merge displays behaviour that would be considered uncharacteristic of a JOIN which cause matches over `np.NaN` and `None` values. This bug is currently tracked by issues [#22491](https://github.com/pandas-dev/pandas/issues/22491) and [#22618](https://github.com/pandas-dev/pandas/issues/22618) in the pandas github repo."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" <th>col2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>C</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>None</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1 col2\n",
"0 NaN 1\n",
"1 B 2\n",
"2 C 3\n",
"3 None 4"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col11</th>\n",
" <th>col22</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B</td>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>None</td>\n",
" <td>44</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col11 col22\n",
"0 A 11\n",
"1 B 22\n",
"2 NaN 33\n",
"3 None 44"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" <th>col2</th>\n",
" <th>col11</th>\n",
" <th>col22</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>None</td>\n",
" <td>4</td>\n",
" <td>NaN</td>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>None</td>\n",
" <td>4</td>\n",
" <td>None</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>B</td>\n",
" <td>2</td>\n",
" <td>B</td>\n",
" <td>22</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1 col2 col11 col22\n",
"0 NaN 1 NaN 33\n",
"1 NaN 1 None 44\n",
"2 None 4 NaN 33\n",
"3 None 4 None 44\n",
"4 B 2 B 22"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"df1 = pd.DataFrame({\"col1\" : [np.nan, \"B\", \"C\", None], \"col2\" : [1, 2, 3, 4]})\n",
"display(df1)\n",
"df2 = pd.DataFrame({\"col11\" : [\"A\", \"B\", np.nan, None], \"col22\" : [11, 22, 33, 44]})\n",
"display(df2)\n",
"pd.merge(df1, df2, left_on=\"col1\" , right_on=\"col11\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If presence of these `NULL` values is not caught in the initial EDA and the bug is not known, the post-merge analysis will likely be faulty. One quick way to skip around this bug would be to drop the values as follows before the merge."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" <th>col2</th>\n",
" <th>col11</th>\n",
" <th>col22</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>B</td>\n",
" <td>2</td>\n",
" <td>B</td>\n",
" <td>22</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1 col2 col11 col22\n",
"0 B 2 B 22"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df1.dropna(subset=['col1']), df2.dropna(subset=['col11']), left_on='col1' , right_on='col11')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"Oh, and this is [the analysis I did based on this data](https://huyenchip.com/2019/08/21/glassdoor-interview-reviews-tech-hiring-cultures.html), in case you're interested!"
]
}
Expand All @@ -3380,7 +3676,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
"version": "3.8.3"
},
"varInspector": {
"cols": {
Expand Down