If you are like me who have been working on data analysis using good old Excel, it’s been getting more difficult as the data size grow. Common tasks like vlookup, filtering and chart plotting become slower to process and hard to use. This is why I turned to Pandas to perform the common tasks in excel and a lot more..
I’m writing this article to show the codes I used to perform the common tasks in email marketing data analysis. Here, I use soft bounce analysis as an example.
If you are first timer to Python, please read this article on how to install Python and Pandas on your computer.
Common excel tasks
- Read data
- Group and count data by shared row
- VLookup
- Filter
- Deduplicate
- Calculation
- Build Graph
1.Reading an Excel file in Python using Pandas
To read the excel data, we need to first load the excel file in python using pandas
#import the pandas library and aliasing as pd
import pandas as pd
#read log data from excel file
filename = ‘APJC-FY18-Bounce history.xlsx’
df = pd.read_excel(filename)
xlrd module does not come with Pandas library
Make sure you install the module by running
pip install xlrd
You will receive the exception below when you try to read an excel file if it’s not installed
ImportError: No module named 'xlrd'
Here’s the sample data of bounceback histrory log
Email Address | SMTP Error Code | SMTP Reply Code | Bounceback Message | Bounceback Date/Time |
---|---|---|---|---|
[email protected] | 5.4.14 | 554 | smtp; 5.3.0 – Other mail system problem 554-‘5.4.14 Hop count exceeded | 2/18/2018 16:01 |
[email protected] | 5.4.14 | 554 | smtp; 5.3.0 – Other mail system problem 554-‘5.4.14 Hop count exceeded | 3/18/2018 17:04 |
[email protected] | 5.4.14 | 554 | smtp; 5.3.0 – Other mail system problem 554-‘5.4.14 Hop count exceeded | 2/14/2018 16:10 |
[email protected] | 5.4.1 | 550 | reply smtp; 5.1.0 – Unknown address error 550-‘5.4.1 | 7/24/2018 20:00 |
[email protected] | 4.4.0 | 421 | [internal] no MXs for this domain could be reached at this time | 7/24/2018 21:01 |
[email protected] | 4.4.0 | 421 | [internal] no MXs for this domain could be reached at this time | 7/24/2018 21:01 |
[email protected] | 4.4.0 | 421 | [internal] no MXs for this domain could be reached at this time | 2/11/2018 16:01 |
[email protected] | 4.4.0 | 421 | [internal] no MXs for this domain could be reached at this time | 6/7/2018 20:02 |
[email protected] | 4.4.0 | 421 | [internal] no MXs for this domain could be reached at this time | 7/24/2018 21:01 |
[email protected] | 5.4.14 | 554 | smtp; 5.3.0 – Other mail system problem 554-‘5.4.14 Hop count exceeded | 7/18/2018 23:01 |
[email protected] | 5.4.14 | 554 | smtp; 5.3.0 – Other mail system problem 554-‘5.4.14 Hop count exceeded | 4/8/2018 17:01 |
[email protected] | 4.4.0 | 421 | [internal] no MXs for this domain could be reached at this time | 4/8/2018 17:01 |
[email protected] | 4.4.0 | 421 | [internal] no MXs for this domain could be reached at this time | 4/8/2018 17:01 |
[email protected] | 4.4.0 | 421 | [internal] no MXs for this domain could be reached at this time | 4/8/2018 17:01 |
… | … | .. | .. | .. |
What I tried to accomplish
- Read the data in excel
- Group and count the number of different soft bounces based on their SMTP reply Code for each contact (based on their email addresses). Go to 2
- Select only the SMTP reply code with value of 421 and number of bounces are 3 or more from the result above. Go to 3
- Use the result at 2. and Vlookup on the original table to get the matches. Go to 4
- Deduplicate the same contacts (email addresses) and get the latest bounceback date. Go to 5
https://gist.github.com/miclow/c7c5c93fcb09abf0c3c4bc2e0b5923b1
2. Group and count data by shared row
Here I wanted to group and count the number of occurrence based on their Email Address and SMTP Reply Code. The new column for the count was renamed as Count of Bounces.
Line 9 df1 = df.groupby(['Email Address','SMTP Reply Code'], as_index=False).size().reset_index(name='Count of Bounces')
3. Filter the data
One of the common excel tasks is Data Filtering.
It’s super easy to filter and select a subset of data based on certain criteria.
At line 12, I selected only rows with SMTP Reply Code equals to 421
At line 15, I selected only rows with Count of Bounces equals or more than 3 from the results at line 12
Line 12 df2 = df1.loc[df1['SMTP Reply Code'] == 421]
Line 15 df3 = df2.loc[df2['Count of Bounces'] >=3 ]
4. VLookup
Another common excel task is VLookup by comparing two excel files/table based on shared values
In this example, I Vlookup on table df (the original table) using table df3 to get all data of shared Email Address
Line 18 df4 = pd.merge(df, df3, on=['Email Address'])
There are 4 different merge options:
‘left’ — Use the shared column from the left dataframe and match to right dataframe. Fill in any N/A as NaN
‘right’ — Use the shared column from the right dataframe and match to left dataframe. Fill in any N/A as NaN
‘inner’ — Only show data where the two shared columns overlap. Default method.
‘outer’ — Return all records when there is a match in either left or right dataframe.
5. Deduplicate
Lastly, I wanted to remove the duplicates and keep only the unique Email Address .
Since the bounceback date was in ascending order, keep="last"
was used to keep the last row after deduplication. Deduplication will by default keeps the first row if not specified.
Line 21 df5 = df4.drop_duplicates(subset='Email Address', keep="last")
6. Calculation
7. Plotting Chart
Problems with plotting chart in Excel is that it’s slow and inflexible. Sometimes, it’s really hard to translate the visual in your head into Excel.
Pandas allows you to build chart easily.