PyScopus: An example for author disambiguity

Sometimes Scopus would mix up people with similar names. I recently come up with a not that difficult method to clean author publication profiles, which needs some manual work.

If you can think of a better way, please do let me know!

In [3]:
import pyscopus
In [4]:
from pyscopus import Scopus
scopus = Scopus(key)

Wrapper functions for disambiguity

In [5]:
import requests, time
import pandas as pd
from bs4 import BeautifulSoup as Soup
In [6]:
def _check_pub_validity(sid, author_id, author_affil_id_list, apikey):
    global BASE_URL
    r = requests.get(BASE_URL+sid, params={'apikey': apikey})
    soup = Soup(r.content, 'lxml')
    author_list = soup.find('authors').find_all('author')

    ## go through the author list to find the author first, by matching author id
    for au in author_list:
        if au['auid'] == author_id:
            ## find it and break

    ## check the affiliation id: note that an author may have a list of affiliations
    this_affil_id_list = [affil_tag['id'] for affil_tag in au.find_all('affiliation')]
    ## get the affiliation id and check if there are any overlap
    if len(set(author_affil_id_list).intersection(set(this_affil_id_list))) > 0:
        return True
    return False
In [7]:
def check_pub_validity(scopus_obj, author_id, author_affil_id_list, apikey):
    ## first find out all pub
    pub_df = scopus_obj.search_author_publication(author_id)
    ## do this for all non-null scopus ids
    pub_df = pub_df[~pub_df.scopus_id.isnull()]
    ## list to save all eligble scopus ids
    eligible_scopus_id_list = list()
    for i, sid in enumerate(pub_df.scopus_id.values):
        if (i+1)%5==0:
        if _check_pub_validity(sid, author_id, author_affil_id_list, apikey):
            ## if true, save it
    ## finally, get a subset of the original pub_df
    filtered_pub_df = pub_df.query("scopus_id in @eligible_scopus_id_list")
    return filtered_pub_df

When I was collecting data for my own research, I found that Dr. Vivek K. Singh has a very noisy profile in Scopus. Let's use this as an example.

The basic idea is to match author-affiliation pair:

  • For all the paper found in the mixed profile
    • Find the focal author (in this case, Dr. Singh)
    • Look at his/her affiliation
      • Keep this paper if the affiliation is indeed where he/she is
      • If not, discard the paper

For Dr. Singh, I manually obtained his affiliation ids by searching through Scopus affiliation search. Upon obtaining that, create a dictionary containing name (first/last), affiliation name, and a list of affiliation ids. Author and affiliation names would be used to search for this author. The list of affiliation ids would be used for cleaning papers:

  • UC Irvine 60007278
  • MIT 60022195
  • Rutgers 60030623
In [8]:
d = {'authfirst': 'Vivek', 'authlastname': 'Singh', 'affiliation': 'Rutgers',
     'affil_id_list': ['60030623', '60022195', '60007278']
{'authfirst': 'Vivek',
 'authlastname': 'Singh',
 'affiliation': 'Rutgers',
 'affil_id_list': ['60030623', '60022195', '60007278']}
In [9]:
query = "AUTHLASTNAME({}) and AUTHFIRST({}) and AFFIL({})".format(d['authlastname'], d['authfirst'], d['affiliation'])
author_search_df = scopus.search_author(query)
The history saving thread hit an unexpected error (OperationalError('database is locked',)).History will not be written to the database.
author_id name document_count affiliation affiliation_id
0 7404651152 Vivek Kumar N. Singh 491 Shri Mata Vaishno Devi University 60017187

Sometimes we would obtain a list of author profiles for each author. In this case, we only have one and it is clear that the author profile is highly noisy.

In the following step, I would use the helper functions in utils to screen each paper by this author_id

In [10]:
author_id = '7404651152'
author_id, d['affil_id_list']
('7404651152', ['60030623', '60022195', '60007278'])

The filtering process may take a while, depending on how many documents are mixd up.

In [12]:
filterd_pub_df = check_pub_validity(scopus, author_id, d['affil_id_list'], key)
filterd_pub_df.shape[0], filterd_pub_df.scopus_id.unique().size, filterd_pub_df.scopus_id.isnull().sum()
(134, 134, 0)

Obviously, the number of papers is highly reduced. We can now check a random subset to see if the filtered papers make sense for this author.

In [13]:
filterd_pub_df.iloc[, high=134, size=20)][['title', 'publication_name']]
title publication_name
118 Effects of high-energy irradiation on silicon ... Optics InfoBase Conference Papers
95 Physical-Cyber-Social Computing: Looking Back,... IEEE Internet Computing
141 Low-stress silicon nitride for mid-infrared mi... Optics InfoBase Conference Papers
194 Mid-infrared silicon waveguide resonators with... Materials Research Society Symposium Proceedings
67 Effects of high-energy irradiation on silicon ... Optics InfoBase Conference Papers
194 Mid-infrared silicon waveguide resonators with... Materials Research Society Symposium Proceedings
89 Preface Geo-Intelligence and Visualization through Big...
179 Demonstration of high-Q mid-infrared chalcogen... Optics Letters
147 Low-Stress silicon nitride platform for broadb... Optics InfoBase Conference Papers
309 Situation based control for cyber-physical env... Proceedings - IEEE Military Communications Con...
42 Towards measuring fine-grained diversity using... Proceedings of the 11th International Conferen...
306 Motivating contributors in social media networks 1st ACM SIGMM International Workshop on Social...
146 Mid-infrared opto-nanofluidics for label-free ... Optics InfoBase Conference Papers
54 Gradient Polymer Nanofoams for Encrypted Recor... ACS Nano
95 Physical-Cyber-Social Computing: Looking Back,... IEEE Internet Computing
336 Towards environment-to-environment (E2E) multi... MM'08 - Proceedings of the 2008 ACM Internatio...
144 Low-stress silicon nitride platform for broadb... Conference on Lasers and Electro-Optics Europe...
209 Anisotropic photoluminescence from Er-TeO<inf>... CLEO: Science and Innovations, CLEO_SI 2012
95 Physical-Cyber-Social Computing: Looking Back,... IEEE Internet Computing
71 On-chip mid-infrared gas detection using chalc... Applied Physics Letters

However, there may still be noise in it (e.g., papers published in optics/photonics venues). We can manually exclude those as well:

In [14]:
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('optic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('photonic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('nano')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('quantum')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('sensor')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('cleo')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('materials')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('physics')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('chip')")
(59, 16)

And let's check again

In [15]:
filterd_pub_df.iloc[, high=59, size=20)][['title', 'publication_name']]
title publication_name
103 Assessing personality using demographic inform... ACM International Conference Proceeding Series
44 Examining information search behaviors in smal... Proceedings of the Association for Information...
44 Examining information search behaviors in smal... Proceedings of the Association for Information...
18 Social bridges in urban purchase behavior ACM Transactions on Intelligent Systems and Te...
62 If it looks like a spammer and behaves like a ... International Journal of Information Security
152 EventShop: Recognizing situations in web data ... WWW 2013 Companion - Proceedings of the 22nd I...
5 Are you altruistic? Your mobile phone could tell 2017 IEEE SmartWorld Ubiquitous Intelligence a...
103 Assessing personality using demographic inform... ACM International Conference Proceeding Series
247 EventShop: From heterogeneous web streams to p... Proceedings of the 4th Annual ACM Web Science ...
61 LTA 2016 - The first workshop on lifelogging t... MM 2016 - Proceedings of the 2016 ACM Multimed...
34 Effect of gamma exposure on chalcogenide glass... IEEE Radiation Effects Data Workshop
95 Physical-Cyber-Social Computing: Looking Back,... IEEE Internet Computing
298 Structural analysis of the emerging event-web Proceedings of the 19th International Conferen...
17 New Signals in Multimedia Systems and Applicat... IEEE Multimedia
152 EventShop: Recognizing situations in web data ... WWW 2013 Companion - Proceedings of the 22nd I...
317 Adversary aware surveillance systems IEEE Transactions on Information Forensics and...
30 Toward harmonizing self-reported and logged so... Conference on Human Factors in Computing Syste...
74 Predicting privacy attitudes using phone metadata Lecture Notes in Computer Science (including s...
89 Preface Geo-Intelligence and Visualization through Big...
64 Probing the interconnections between geo-explo... UbiComp 2016 - Proceedings of the 2016 ACM Int...

Now it is much better and we can use this cleaned paper list for this focal author.