Threading or Multiprocessing
I have some code that I am trying to accelerate. My goal is to download
and save about a million files. I am using the requests library to access
the content. I am more confused then ever. Most of the Q/A suggest that
the proper method is to use the threading module when a task is I/O bound
and since I am connecting to a server, waiting for a response and then
writing the response to disk my task is I/O bound.
But then I read something like this
Multiple threads can exist in a single process. The threads that belong to
the same process share the same memory area (can read from and write to
the very same variables, and can interfere with one another).
my code goes something like this - before threading
def create_list(some_ftp_site):
# do some stuff to compare the list to
# the last list and return the difference in the two
return list_to_pull
def download_and save_the_file(some_url):
thestring = requests.get(some_url).content
file_ref = open(something)
fileref.write(the_string)
fileref.close()
if __name__ == '__main__'
files_to_get = create_list(some_ftp_site)
if len(files_to_get) != 0:
for file_to_get in files_to_get:
download_and_save(file_to_get)
Using either is a jump into the deep-end for me. Thus, if I multithread
this I am afraid that I could have something unexpected happen for example
the first half of some file concatenated to the second half of another.
Is this type of task better suited for multiprocessing or multithreading.
Clearly I am not going to know if two different file parts are
concatenated because they written to the same variable
No comments:
Post a Comment