{"id":146,"date":"2018-09-04T01:06:17","date_gmt":"2018-09-04T01:06:17","guid":{"rendered":"http:\/\/thegreenbrain.com.au\/blog\/?p=146"},"modified":"2018-09-27T03:53:27","modified_gmt":"2018-09-27T03:53:27","slug":"using-python-pandas-for-marketing-data-analysis-in-excel","status":"publish","type":"post","link":"https:\/\/thegreenbrain.com.au\/blog\/using-python-pandas-for-marketing-data-analysis-in-excel\/","title":{"rendered":"Using Python Pandas for Marketing Data Analysis in Excel"},"content":{"rendered":"<p>If you are like me who have been working on data analysis using good old Excel, it&#8217;s been getting more difficult as the data size grow. Common tasks like vlookup, filtering and chart plotting become slower to process and hard to use. This is why I turned to Pandas to perform the common tasks in excel and a lot more..<\/p>\n<p>I&#8217;m writing this article to show the codes I used to perform the common tasks in email marketing data analysis. Here, I use soft bounce analysis as an example.<\/p>\n<p>If you are first timer to Python, please read this article on <a href=\"https:\/\/thegreenbrain.com.au\/blog\/install-pandas-for-data-analysis-windows\/\">how to install Python and Pandas on your computer<\/a>.<\/p>\n<p>Common excel tasks<\/p>\n<ol>\n<li>Read data<\/li>\n<li>Group and count data by shared row<\/li>\n<li>VLookup<\/li>\n<li>Filter<\/li>\n<li>Deduplicate<\/li>\n<li>Calculation<\/li>\n<li>Build Graph<\/li>\n<\/ol>\n<p>1.<strong>Reading an Excel file in Python using Pandas<\/strong><\/p>\n<p>To read the excel data, we need to first load the excel file in python using pandas<\/p>\n<p>#import the pandas library and aliasing as pd<br \/>\nimport pandas as pd<\/p>\n<p>#read log data from excel file<br \/>\nfilename = &#8216;APJC-FY18-Bounce history.xlsx&#8217;<br \/>\ndf = pd.read_excel(filename)<\/p>\n<p>xlrd module does not come with Pandas library<\/p>\n<p>Make sure you install the module by running<br \/>\n<code>pip install xlrd<\/code><\/p>\n<p>You will receive the exception below when you try to read an excel file if it&#8217;s not installed<br \/>\n<code>ImportError: No module named 'xlrd'<\/code><\/p>\n<p>Here&#8217;s the sample data of bounceback histrory log<\/p>\n<table style=\"font-size: 12px;border: 1px solid #CCC;font-family: Arial, Helvetica, sans-serif;\">\n<thead>\n<tr>\n<th style=\"background-color: #104E8B;color: #FFF;font-weight: bold;\">Email Address<\/th>\n<th style=\"background-color: #104E8B;color: #FFF;font-weight: bold;\">SMTP Error Code<\/th>\n<th style=\"background-color: #104E8B;color: #FFF;font-weight: bold;\">SMTP Reply Code<\/th>\n<th style=\"background-color: #104E8B;color: #FFF;font-weight: bold;\">Bounceback Message<\/th>\n<th style=\"background-color: #104E8B;color: #FFF;font-weight: bold;\">Bounceback Date\/Time<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>alex@test.com<\/td>\n<td>5.4.14<\/td>\n<td>554<\/td>\n<td>smtp; 5.3.0 &#8211; Other mail system problem 554-&#8216;5.4.14 Hop count exceeded<\/td>\n<td>2\/18\/2018 16:01<\/td>\n<\/tr>\n<tr>\n<td>alex@test.com<\/td>\n<td>5.4.14<\/td>\n<td>554<\/td>\n<td>smtp; 5.3.0 &#8211; Other mail system problem 554-&#8216;5.4.14 Hop count exceeded<\/td>\n<td>3\/18\/2018 17:04<\/td>\n<\/tr>\n<tr>\n<td>alex@test.com<\/td>\n<td>5.4.14<\/td>\n<td>554<\/td>\n<td>smtp; 5.3.0 &#8211; Other mail system problem 554-&#8216;5.4.14 Hop count exceeded<\/td>\n<td>2\/14\/2018 16:10<\/td>\n<\/tr>\n<tr>\n<td>bead@test.com<\/td>\n<td>5.4.1<\/td>\n<td>550<\/td>\n<td>reply smtp; 5.1.0 &#8211; Unknown address error 550-&#8216;5.4.1 <\/td>\n<td>7\/24\/2018 20:00<\/td>\n<\/tr>\n<tr>\n<td>gemma@bc.com<\/td>\n<td>4.4.0<\/td>\n<td>421<\/td>\n<td>[internal] no MXs for this domain could be reached at this time<\/td>\n<td>7\/24\/2018 21:01<\/td>\n<\/tr>\n<tr>\n<td>gemma@bc.com<\/td>\n<td>4.4.0<\/td>\n<td>421<\/td>\n<td>[internal] no MXs for this domain could be reached at this time<\/td>\n<td>7\/24\/2018 21:01<\/td>\n<\/tr>\n<tr>\n<td>gemma@bc.com<\/td>\n<td>4.4.0<\/td>\n<td>421<\/td>\n<td>[internal] no MXs for this domain could be reached at this time<\/td>\n<td>2\/11\/2018 16:01<\/td>\n<\/tr>\n<tr>\n<td>abc@tnb.com<\/td>\n<td>4.4.0<\/td>\n<td>421<\/td>\n<td>[internal] no MXs for this domain could be reached at this time<\/td>\n<td>6\/7\/2018 20:02<\/td>\n<\/tr>\n<tr>\n<td>patrick@itoc.com<\/td>\n<td>4.4.0<\/td>\n<td>421<\/td>\n<td>[internal] no MXs for this domain could be reached at this time<\/td>\n<td>7\/24\/2018 21:01<\/td>\n<\/tr>\n<tr>\n<td>patrick@itoc.com<\/td>\n<td>5.4.14<\/td>\n<td>554<\/td>\n<td>smtp; 5.3.0 &#8211; Other mail system problem 554-&#8216;5.4.14 Hop count exceeded<\/td>\n<td>7\/18\/2018 23:01<\/td>\n<\/tr>\n<tr>\n<td>patrick@itoc.com<\/td>\n<td>5.4.14<\/td>\n<td>554<\/td>\n<td>smtp; 5.3.0 &#8211; Other mail system problem 554-&#8216;5.4.14 Hop count exceeded<\/td>\n<td>4\/8\/2018 17:01<\/td>\n<\/tr>\n<tr>\n<td>abc@tnb.com<\/td>\n<td>4.4.0<\/td>\n<td>421<\/td>\n<td>[internal] no MXs for this domain could be reached at this time<\/td>\n<td>4\/8\/2018 17:01<\/td>\n<\/tr>\n<tr>\n<td>abcd@tnb.com<\/td>\n<td>4.4.0<\/td>\n<td>421<\/td>\n<td>[internal] no MXs for this domain could be reached at this time<\/td>\n<td>4\/8\/2018 17:01<\/td>\n<\/tr>\n<tr>\n<td>abcd@tnb.com<\/td>\n<td>4.4.0<\/td>\n<td>421<\/td>\n<td>[internal] no MXs for this domain could be reached at this time<\/td>\n<td>4\/8\/2018 17:01<\/td>\n<\/tr>\n<tr>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>..<\/td>\n<td>..<\/td>\n<td>..<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><em><strong>What I tried to accomplish<\/strong><\/em><\/p>\n<ol>\n<li>Read the data in excel<\/li>\n<li>Group and count the number of different soft bounces based on their SMTP reply Code for each contact (based on their email addresses). Go to 2<\/li>\n<li>Select only the SMTP reply code with value of 421 and number of bounces are 3 or more from the result above. Go to 3<\/li>\n<li>Use the result at 2. and Vlookup on the original table to get the matches. Go to 4<\/li>\n<li>Deduplicate the same contacts (email addresses) and get the latest bounceback date. Go to 5<\/li>\n<\/ol>\n<p>https:\/\/gist.github.com\/miclow\/c7c5c93fcb09abf0c3c4bc2e0b5923b1<\/p>\n<p>2. <strong>Group and count data by shared row<\/strong><\/p>\n<p>Here I wanted to group and count the number of occurrence based on their <strong>Email Address and SMTP Reply Code.<\/strong> The new column for the count was renamed as <strong>Count of Bounces<\/strong>.<\/p>\n<p>Line 9 <code>df1 = df.groupby(['Email Address','SMTP Reply Code'], as_index=False).size().reset_index(name='Count of Bounces')<\/code><\/p>\n<p>3. <strong>Filter the data<\/strong><\/p>\n<p>One of the common excel tasks is <strong>Data Filtering<\/strong>.<br \/>\nIt&#8217;s super easy to filter and select a subset of data based on certain criteria.<\/p>\n<p>At line 12, I selected only rows with SMTP Reply Code equals to 421<br \/>\nAt line 15, I selected only rows with Count of Bounces equals or more than 3 from the results at line 12<\/p>\n<p>Line 12 <code>df2 = df1.loc[df1['SMTP Reply Code'] == 421]<\/code><br \/>\nLine 15 <code>df3 = df2.loc[df2['Count of Bounces'] >=3 ]<\/code><\/p>\n<p>4. <strong>VLookup<\/strong><\/p>\n<p>Another common excel task is <strong>VLookup<\/strong> by comparing two excel files\/table based on shared values<\/p>\n<p>In this example, I Vlookup on table df (the original table) using table df3 to get all data of shared <strong>Email Address<\/strong><\/p>\n<p>Line 18 <code>df4 = pd.merge(df, df3, on=['Email Address'])<\/code><\/p>\n<p>There are 4 different merge options:<\/p>\n<p>\u2018left\u2019\u200a\u2014\u200aUse the shared column from the left dataframe and match to right dataframe. Fill in any N\/A as NaN<\/p>\n<p>\u2018right\u2019\u200a\u2014\u200aUse the shared column from the right dataframe and match to left dataframe. Fill in any N\/A as NaN<\/p>\n<p>\u2018inner\u2019\u200a\u2014\u200aOnly show data where the two shared columns overlap. Default method.<\/p>\n<p>\u2018outer\u2019\u200a\u2014\u200aReturn all records when there is a match in either left or right dataframe.<\/p>\n<p>5. <strong>Deduplicate<\/strong><\/p>\n<p>Lastly, I wanted to remove the duplicates and keep only the unique <strong>Email Address <\/strong>.<br \/>\nSince the bounceback date was in ascending order, <code>keep=\"last\"<\/code> was used to keep the last row after deduplication. Deduplication will by default keeps the first row if not specified.<\/p>\n<p>Line 21 <code>df5 = df4.drop_duplicates(subset='Email Address', keep=\"last\")<\/code><\/p>\n<p>6. <strong>Calculation<\/strong><\/p>\n<p>7. <strong>Plotting Chart<\/strong><\/p>\n<p>Problems with plotting chart in Excel is that it&#8217;s slow and inflexible. Sometimes, it&#8217;s really hard to translate the visual in your head into Excel.<\/p>\n<p>Pandas allows you to build chart easily.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you are like me who have been working on data analysis using good old Excel, it&#8217;s been getting more difficult as the data size grow. Common tasks like vlookup, filtering and chart plotting become slower to process and hard to use. This is why I turned to Pandas to perform the common tasks in&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-146","post","type-post","status-publish","format-standard","hentry","category-data-analysis"],"_links":{"self":[{"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/posts\/146","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/comments?post=146"}],"version-history":[{"count":51,"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/posts\/146\/revisions"}],"predecessor-version":[{"id":218,"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/posts\/146\/revisions\/218"}],"wp:attachment":[{"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/media?parent=146"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/categories?post=146"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/thegreenbrain.com.au\/blog\/wp-json\/wp\/v2\/tags?post=146"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}